MIT6006 Lec08 Hashing I

2018-11-17

Dictionary and Motivation

Dictionary problem

Dictionary is a abstract data tupe(ADT).
Which means it maintains a set of items, each with a key
The available operations are:
- insert(item): add item to the set (The key must be unique, otherwise overide an exsiting key)
- delete(item): remove item from set
- search(key): return item with key if it exists
  Balanced BST solve in $O(lg n)$ time per operation. The goal is to make each operation $O(1)$ time.

(remember that BST etc. are data structure, which can be seen as the implementation of ADT)

Motivation

Dictionaries are perhaps the most popular data structure in CS

built into most modern programming languages (Python, Perl, Ruby, JavaScript, Java, C++, C#, . . . )
e.g. best docdist code: word counts & inner product
implement databases: (DB HASH in Berkeley DB)
- English word → definition (literal dict.)
- English words: for spelling correction
- word → all webpages containing that word
- username → account object
compilers & interpreters: names → variables
network routers: IP address → wire
network server: port number → socket/app.
virtual memory: virtual address → physical

Prehashing and Hashing

How to solve the dictionary problem that is to make operation time to $O(1)$?

Simple Approach: Direct Access Table
This means items would need to be stored in an array, indexed by key (random access). But this has two problems:

keys must be nonnegative integers (or using two arrays, integers)
if the key happens to have large range, then need large space — e.g. one key is 2256, even with no other keys, the table must have take that space.

Solution
Solution to 1: “prehash” keys to integers

In theory, possible because keys are finite ⇒ set of keys is countable
In Python: hash(object) (actually hash is misnomer, should be “prehash”) where object can be a number, string, tuple, etc. or object implementing hash (default = id = memory address)
In theory, x = y ⇔ hash(x) = hash(y)
Python applies some heuristics for practicality: for example, hash(‘\0B ’) = 64 = hash(‘\0\0C’)
Object’s key should not change while in table (else cannot find it anymore)
No mutable objects like lists

Solution to 2: hashing

Reduce universe U of all keys (say, integers) down to reasonable size m for table
idea: m ≈ n = number of keys stored in dictionary
hash function h: $U → {0, 1, . . . , m − 1}$
two keys $k_i, k_j ∈ K$ collide if $h(k_i) = h(k_j)$

Hash Function: A function that converts a given big phone number to a small practical integer value. The mapped integer value is used as an index in hash table.

How do we deal with collisions? We will see two ways

Chaining
Open addressing: in Lec10

Chaining

Linked list of colliding elements in each slot of table

Search must go through whole list T[h(key)]
Worst case: all n keys hash to same slot ⇒ Θ(n) per operation

Simple uniform hashing

An assumption (cheating): Each key is equally likely to be hashed to any slot of table, independent of where other keys are hashed.

let n = number of keys stored in table
m = number of slots in table
load factor α = n/m = expected number of keys per slot = expected length of a chain

Performance:
This implies that expected running time for search is $Θ(1+α)$ — the 1 comes from applying the hash function and random access to the slot whereas the $α$ comes from searching the list. This is equal to $O(1)$ if $α = O(1)$, i.e., $m = Ω(n)$.

Hash Function

Following are three methods to achieve the above performance

Division Method:
$$h(k) = k \; mod \; m$$
This is practical when m is prime but not too close to power of 2 or 10 (then just depending on low bits/digits).
But it is inconvenient to find a prime number, and division is slow.

Multiplication Method:
$$h(k) = [(a · k) \; mod \; 2^w] \gg (w − r)$$
where a is random, k is w bits, and $m = 2^r$.
This is practical when a is odd & $2^{w−1}$ < $a$ < $2^w$ & a not too close to $2^{w−1}$ or $2^w$.
Multiplication and bit extraction are faster than division.

Universal Hashing:
For example: $h(k) = [(ak+b) \; mod \; p] mod \; m$ where a and b are random $∈ {0, 1, . . . p−1}$, and p is a large prime ($> |U|$).
This implies that for worst case keys $k_1 \ne k_2$, (and for a, b choice of h): $$Pr_{a,b}(event \; X_{k_1k_2}) = Pr_{a,b}\{h(k_1)=h(k_2)\} = \frac 1m$$

展开全文 >>

MIT6006 Lec07 Linear-Time Sorting

2018-11-15

Need to digest again!

First lists some claims:

searching among n preprocessed items requires $Ω(lg n)$ time: binary search, AVL tree search optimal
sorting n items requires $Ω(nlg n)$: mergesort, heap sort, AVL sort optimal

Comparison model

All the input items are black boxes (ADTs)
Only support comparisons (>,<, etc.)
Time cost = number of comparisons

Decision tree

Any comparison algorithm can be viewed/specified as a tree of all possible comparison outcomes & resulting output.

internal node = binary decision
leaf = output (algorithm is done)
root-to-leaf path = algorithm execution
path length (depth) = running time
height of tree = worst-case running time

In fact, binary decision tree model is more powerful than comparison model, and lower bounds extend to it

A sorting algorithm is comparison based if it uses comparison operators to find the order between two numbers. Comparison sorts can be viewed abstractly in terms of decision trees. A decision tree is a full binary tree that represents the comparisons between elements that are performed by a particular sorting algorithm operating on an input of a given size. The execution of the sorting algorithm corresponds to tracing a path from the root of the decision tree to a leaf. At each internal node, a comparison ai <= aj is made. The left subtree then dictates subsequent comparisons for ai <= aj, and the right subtree dictates subsequent comparisons for ai > aj. When we come to a leaf, the sorting algorithm has established the ordering.

Lower Bounds

Search Lower Bound

number of leaves >= number of possible answers >= n
decision tree is binary
height >= $lg Θ(n) = lg n ± Θ(1)$

Sorting Lower Bound

leaf specifies answer as permutation: A[3] ≤ A[1] ≤ A[9] ≤ . . .
all n! are possible answers
number of leaves >= n!

$$ \begin{align*}
height & ≥ lg n! \\
& = lg(1·2 /cdots (n − 1)·n) \\
& = lg 1 + lg 2 + /cdots + lg(n − 1) + lg n \\
& = \sum^n_{i=1} lg i \\
& \ge \sum^n_{i=n/2} lg i \\
& \ge \sum^n_{i=n/2} lg \frac n2 \\
& = \frac n2 lg n - \frac n2 \\
& = \Omega (n lg n)
\end{align*}$$

Therefore, the lower bound for Comparison based sorting algorithm (Merge Sort, Heap Sort, Quick-Sort .. etc) is $Ω(nLogn)$, i.e., they cannot do better than $n logn$.

Linear-time Sorting

Radix sort

Counting sort is a linear time sorting algorithm that sort in $O(n+k)$ time when elements are in range from 1 to k. But if the elements are in range from 1 to $n^2$, counting sort will take $O(n^2)$.

How to sort such an array in linear time? Use Radix sort.
The idea of Radix Sort is to do digit by digit sort starting from least significant digit to most significant digit. Radix sort uses counting sort as a subroutine to sort.

Reference:
https://www.geeksforgeeks.org/lower-bound-on-comparison-based-sorting-algorithms/
https://www.geeksforgeeks.org/radix-sort/

展开全文 >>

MIT6006 Lec06 AVL trees, AVL sort

2018-11-14

Recall the properties of binary search tree:

each node has
- key
- left pointer
- right pointer
- parent pointer
height of the node = length (number of edges) of longest downward path to a leaf
BST will do all operations in $O(h)$ time, h = height
h is between $lg n$ and $n$.

The advantage of balanced tree is that it maintains $h=O(lg n)$, therefore, all operations run in $O(lg n)$ time.

Tree rotation

The operation “rotation” is to make a tree balanced and will be used in fixing the inbalance of AVL tree.
There are two categories: single rotation and two rotations.

Left Rotation
Application scenario: When a node A’s right subtree becomes inbalanced after inserting another new one. Then need to apply the left rotation of that node (A).
After performing the left rotation of node A, A’s right child becomes its parent and A becomes the left child.

Right Rotation
Application scenario: When a node A’s left subtree becomes inbalanced after inserting another new one. Then need to apply the right rotation of that node (A).
After performing the right rotation of the unbalanced node A, it becomes the right child of its original left child.

Left-Right Rotation
Application scenario: When a node A becomes inbalanced after inserting another new one to the right subtree of the left subtree of A.
Steps:

Perform left rotation to the left subtree of A
Perform right rotation of the A itself

Right-Left Rotation
Application scenario: When a node A becomes inbalanced after inserting another new one to the left subtree of the right subtree of A.
Steps:

Perform right rotation to the right subtree of A
Perform left rotation of the A itself

Implementation:

def left_rotate(self, x):
    y = x.right
    y.parent = x.parent
    if y.parent is None:
        self.root = y
    else:
        if y.parent.left is x:
            y.parent.left = y
        elif y.parent.right is x:
            y.parent.right = y
    x.right = y.left
    if x.right is not None:
        x.right.parent = x
    y.left = x
    x.parent = y
    update_height(x)
    update_height(y)

def right_rotate(self, x):
    y = x.left
    y.parent = x.parent
    if y.parent is None:
        self.root = y
    else:
        if y.parent.left is x:
            y.parent.left = y
        elif y.parent.right is x:
            y.parent.right = y
    x.left = y.right
    if x.left is not None:
        x.left.parent = x
    y.right = x
    x.parent = y
    update_height(x)
    update_height(y)

Here is a very nice illustration: https://wiki2.org/en/Tree_rotation#/media/File:Tree_Rebalancing.gif

AVL tree

named by the author.
for every node, require height of left & right children to differ by at most 1
- nil tree as height -1
- each node stores its height (using data structure augmentation)

AVL Insert

Perform the normal BST insertion.
The current node must be one of the ancestors of the newly inserted node. Update the height of the current node.
Get the balance factor (left subtree height – right subtree height) of the current node.
If balance factor is greater than 1, then the current node is unbalanced and we are either in Left Left case or left Right case. To check whether it is left left case or not, compare the newly inserted key with the key in left subtree root.
If balance factor is less than -1, then the current node is unbalanced and we are either in Right Right case or Right-Left case. To check whether it is Right Right case or not, compare the newly inserted key with the key in right subtree root.
Time: $\theta (nlg n)$

Code of rebalance the tree:

def rebalance(self, node):
    while node is not None:
        update_height(node)
        if height(node.left) >= 2 + height(node.right):
            if height(node.left.left) >= height(node.left.right):
                self.right_rotate(node)
            else:
                self.left_rotate(node.left)
                self.right_rotate(node)
        elif height(node.right) >= 2 + height(node.left):
            if height(node.right.right) >= height(node.right.left):
                self.left_rotate(node)
            else:
                self.right_rotate(node.right)
                self.left_rotate(node)
        node = node.parent

General

Other balanced trees:

B-Trees/2-3-4 Trees
BB[α] Trees
Red-black Trees
(A) — Splay-Trees
(R) — Skip Lists
(A) — Scapegoat Trees
(R) — Treaps
(R) = use random numbers to make decisions fast with high probability
(A) = “amortized”: adding up costs for several operations =⇒ fast on average

Data Structure (DS) vs Abstract Data Type (ADT)

ADT is a logical description and data structure is concrete.
ADT is the logical picture of the data and the operations to manipulate the component elements of the data.
Data structure is the actual representation of the data during the implementation and the algorithms to manipulate the data elements.
ADT is in the logical level and data structure is in the implementation level.

There are many possible DSs for one ADT.
Some common ADTs:

stack, queue, priority queue, dictionary, sequence, set

Some common DSs used to implement the above ADTS:

array, linked list, hash table (open, closed, circular hashing)
trees (binary search trees, heaps, AVL trees, 2-3 trees, tries, red/black trees, B-trees)

Reference:
https://www.tutorialspoint.com/data_structures_algorithms/avl_tree_algorithm.htm
https://www.geeksforgeeks.org/avl-tree-set-1-insertion/

展开全文 >>

Slurm General Command and Exit Code

2018-11-13

General command

submit a job

1 2	sbatch myscript.sh sbatch --test-only myscript.sh # test a job and find out when your job is estimated to run

Information of jobs for a user

squeue -u <username>    # List all current jobs for a user
squeue -u <username> -t RUNNING # List all running jobs for a user
squeue -u <username> -t PENDING # List all pending jobs for a user
scontrol show jobid -dd <jobid> # List detailed information for a job (useful for troubleshooting)
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps # List status info for a currently running job

Controlling jobs

scancel <jobid>   # cancel one job
scancel -u <username> # cancel all the jobs for a user
scancel -t PENDING -u <username> # cancel all pending jobs for a user
scontrol hold <jobid> # pause a particular job
scontrol requeue <jobid>    # requeue (cancel and rerun) a particular job 
scancel <jobid>_<index> # To cancel an indexed job in a job array

Exit code

开始使用HPC时遇到很多问题，特别是用slurm deploy job的时候，如果在slurm文件中设置将开始，结束，失败的信息发送给自己邮箱时，它会针对每一种情况给一个code，这个code即表示一种状态。下面记录一些常见的 exit code 表明的意思。

The exit code from a batch job is a standard Unix termination signal.
Typically, exit code 0 means successful completion.
Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
Exit codes 129-255 represent jobs terminated by Unix signals.
Each signal has a corresponding value which is indicated in the job exit code

Job Termination Signals

Signal Name	Signal Number	Exit Type	Reason
SIGHUP	1	Term	Hangup detected on controlling terminal or death of controlling process
SIGINT	2	Term	Interrupt from keyboard
SIGQUIT	3	Core	Quit from keyboard
SIGILL	4	Core	Illegal Instruction
SIGABRT	6	Core	Abort signal from abort(3)
SIGFPE	8	Core	Floating point exception
SIGKILL	9	Term	Kill signal
SIGSEGV	11	Core	Invalid memory reference
SIGPIPE	13	Term	Broken pipe: write to pipe with no readers
SIGALRM	14	Term	Timer signal from alarm(2)
SIGTERM	15	Term	Termination signal

Job Exit Status

Exit Code	Reason
9	Ran out of CPU time.
64	The job ended nicely for but your job was running out of CPU time. The solution is to submit the job to a queue with more resources (bigger CPU time limit).
125	An ErrMsg(severe) was reached in your job.
127	Something wrong with the machine?
130	The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.
131	The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks.
134	The job is killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger like gdb or Totalview to find out what’s wrong.
137	The job was killed because it exceeded the time limit.
139	Segmentation violation. Usually indicates a pointer error.
140	The job exceeded the “wall clock” time limit (as opposed to the CPU time limit).

Reference:
https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/
https://sites.google.com/a/case.edu/hpc-upgraded-cluster/cluster-faq/running-jobs/exit-code-status
https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-list-of-run-time-error-messages

展开全文 >>

MIT6006 Lec05 Binary search trees, BST sort

2018-11-07

Preliminary things

Difference between array and list

Array

每个元素都有一个index, 通过index来操作该元素的所有动作。
Array 的大小是一开始就确定的，储存的个数不能超过这个最大限制。然而在实际当中，这个upper limit很少会被达到，但是分出去了就分出去了，不能拿来做别的事情，所以会造成内存的浪费
插入一个新元素会比较 expensive, 因为新元素之后的所有元素都需要移动。同理删除。
Array allow both direct and sequential access to element

List

是有个有序的集合，由指针(链表)来确定位置的，根据不同的implementation, 可分为 linked list 和 dynamic array
每个元素(node), 包含数据同时还有指向下一个元素的指针
大小是动态的，可随时变的
插入和删除操作花费较少
lists allow only sequential access. Therefore, cannot do binary search with linked lists.
Extra memory space for a pointer is required with each element of the list.

A binary tree is made of nodes, where each node contains a “left” reference, a “right” reference, and a data element.

Runway Reservation System problem

Airport with single (very busy) runway
Have “reservations” for future landings
When plane lands, it is removed from set of pending events
Reserve req specify “requested landing time” t
Add t to the set if no other landings are scheduled within k minutes either way. Assume that k can vary.

Algorithm: Keep R as a sorted list

init: R = [ ]
req(t): if t < now: return "error"
for i in range (len(R)):
    if abs(t-R[i]) < k: return "error"
R.append(t)
R = sorted(R)
land: t = R[0]
if (t != now) return error
R = R[1: ]          (drop R[0] from R)

If using sorted list:
Appending and sorting takes $Θ(nlg n)$ time. However, it is possible to insert new time/plane rather than append and sort but insertion takes $Θ(n)$ time. A k minute check can be done in O(1) once the insertion point is found.

If using Sorted array:
It is possible to do binary search to find place to insert in $O(lg n)$ time. Using binary search, we find the smallest i such that R[i] ≥ t, i.e., the next larger element. We then compare R[i] and R[i − 1] against t. Actual insertion however requires shifting elements which requires Θ(n) time.

If using Unsorted list/array:
k minute check takes O(n) time.

If using Min-Heap:
It is possible to insert in $O(lg n)$ time. However, the k minute check will require $O(n)$ time since we need to compare both sides of the heap.

If using Dictionary or Python Set:
Insertion is $O(1)$ time. k minute check takes $Ω(n)$ time

Binary Search Trees (BST)

Properties:

Each node x in the binary tree has a key $key(x)$.
Nodes other than the root have a parent p(x).
Nodes may have a left child left(x) and/or a right child right(x). These are pointers unlike in a heap
For any node x, for all nodes y in the left subtree of x, $key(y) ≤ key(x)$. For all nodes y in the right subtree of x $key(y) ≥ key(x)$.

Operations:

insert(val): Follow left and right pointers till you find the position (or see the value)
find(val): Follow left and right pointers until you find it or hit NIL
findmin(): Key is to just go left till you cannot go left anymore.

Complexity:
All operations are $O(h)$ where h is height of the BST.

Augmenting the BST Structure

When a data structure doesn’t support some of the operations that we need. Then we can often augment the data strucutre by adding a data member or two and an additional operation or two.

Rank(t): give the number of planes are scheduled to land at times ≤ t
Solution: by adding a subtree size memeber into the structure, like another key to the node

展开全文 >>

MIT6006 Lec04 Priority queue and Heap

2018-11-06

Priority Queues

Defination: A data structure implementing a set S of elements, each associated with a key.
Every element has its own key. And the key represents the priority of that element. Keys are not necessarily unique.
It supports the following operations:
- insert(S,x): insert element x into set S
- max(S): return element of S with largest key
- extract_max(S): return element of S with largest key and remove it from S
- increase_key(S,x,k): increase the value of element x’ s key to new value k (assumed to be as large as current value)

Heap

Heap is an implementation of a priority queue
is also an array visualized as a nearly complete binary tree
Heaps can be convert to max heap or min heap when sorting them.
Max heap: the key of a node is ≥ than the keys of its children. (analogously as min heap)
Heap as a tree:
- Root of the tree: the first element (i=1) in the array
- parent(i)=i/2: returns index of node’s parent
- left(i)=2i: returns index of node’s left child
- right(i)=2i+1: returns index of node’s right child
- The height of a binary heap is $O(lg n)$
Heap operations:
- build_max_heap: produce a max-heap from an unordered array
- max_heapify: correct a single violation of the heap property in a subtree at its root
- insert, extract_max, heapsort

Max_heapify:

the assumption is that the trees rooted at left(i) and right(i) are max-heaps.
If element A[i] violates the max-heap property, correct violation by “trickling” element A[i] down the tree, making the subtree rooted at index i a max-heap.

Pseudocode:

l = left(i)
r = right(i)
if (l <= heap-size(A) and A[l] > A[i])
    then largest = l else largest = i
if (r <= heap-size(A) and A[r] > A[largest])
    then largest = r
if largest != i
    then exchange A[i] and A[largest]
Max_Heapify(A, largest)

Build_Max_Heap:

Converts A[1…n] to a max heap

Build_Max_Heap(A):

1 2	for i=n/2 downto 1 do Max_Heapify(A, i)

The reason to start at n/2 is that element A[n/2+1 … n] are all leaves. There is no need to build max_heap for leaves which has only one element (its own).
Time costs: $O(nlog n)$ via simple analysis
Observe however that Max_Heapify takes $O(1)$ for time for nodes that are one level above the leaves(叶子层的上一层，即倒数第二层)，$O(l)$ for the nodes that are $l$ levels above the leaves. There are n/4 nodes in level 1(即倒数第二层), n/8 in level 2 (再往上的的一层) and so on till in $lg n$ level remains one root node.
Therefore, the total amount of work in the for loop is (括号里面的是那一层的Node个数):
$$\frac n4(1c) + \frac n8 (2c) + \frac n{16} (3c) + … + 1(lgn c)$$
set $\frac n4=2^k$简化上式可得：$c 2^k (\frac1{2^0}+\frac2{2^1}+\frac3{2^2}+ … + \frac{(k+1)}{2^k})$, 括号里面上界是个常数
Therefore, Build_max_heap is $O(n)$

Heap sort

Sort strategy:
1. Build Max Heap from unordered array;
2. Find maximum element A[1];
3. Swap elements A[n] and A[1]: now max element is at the end of the array!
4. Discard node n from heap (by decrementing heap-size variable)
5. New root may violate max heap property, but its children are max heaps. Run max_heapify to fix this.
6. Go to Step 2 unless heap is empty
Running time:
- after n iterations the Heap is empty
- every iteration involves a swap and a max_heapify operation; hence it takes O(log n) time
- Overall $O(nlog n)$
A nice demo here: https://www.geeksforgeeks.org/heap-sort/

展开全文 >>

MATLAB 常用语法记录

2018-11-06

Matrix 处理

将所有非零的元素变为1，为零的元素保持为0，使用 ~~, 可用于二值化
1
A=~~A;

Get indices of elements in upper triangle

1 2	nNodes = size(adj,1); % get the number of nodes in a matrix upperInds = find(triu(ones(nNodes),1));

Sort the node strength and plot matrix by sorting results

1
2
3

stg = strengths_und(A)';
[srtVals srtInds] = sort(stg);
imagesc(log10(A(srtInds, srtInds)));

Set infinite values to 0:
1
A(isinf(A))=0;
Compute Euclidean distance and convert to matrix
···
D = pdist(X)； % calculate the distance between pairs in X
M = squareform(D) % convert to the symmetry matrix
···

画图

设置图像背景，位置：

1	hf=figure; hf.Position=[100,350,600,300]; hf.Color='w';

设置坐标轴用 gca

ax = gca; % current axes
ax.FontSize = 12;
ax.XTick = [0 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000 0.9000 1];
ax.XTickLabel = [0.02 0.02];
ax.XLim = [-2 2];

去掉坐标轴，按像素实际大小画图
1
axis off; axis equal;
Scatter plot matrix: diagonal are replaced with histogram plots of the data in the corresponding column of X. Off-diagonal parts are scatter plots of the columns of X against other columns of X.
1
2
X = randn(50,3);
plotmatrix(X)

Read and write file

Read nii file
1
[hdr,data]=read(filename)

Write to a txt file

1 2	dlmwrite('coords.txt',coords,'delimiter',' '); dlmwrite('matrix.txt',A,'delimiter',' ');

Combine data into a table with column headers

1 2	att = array2table([ci stg degree clust], 'VariableNames', {'module','strength','degree','clustering'}); writetable(att, 'attributes.txt','delimiter',' ');

Others

在命令行窗口输出

1	fprintf('Subject: %0.3f\n',some_value);

Get the running time of code
1
2
3
tic
code here
toc
diff function: to calculate the incremental change between each pair of values in an array.
linspace function: creates a vector of n equally spaced elements between any two values, x1 and x2. x = linspace(x1,x2,n)

展开全文 >>

MIT6006 Lec03 插入排序，归并排序

2018-11-05

排序总的问题就是：
输入一个序列 A[1···n]
要求输求它的递增排序之后的排列。

插入排序 (Insertion sort)

Pseudocode

1 2	for j = 2:n insert key A[j] into the already sorted sub array A[1 ... j-1] by pairwise key-swaps down to its right position.

Complexity: $\theta(n^2)$
because at the worst case, it needs $\theta(n^2)$ compares and $\theta(n^2)$ swaps.

Binary Insertion sort

注意到在一般的插入排序中，我们会得到一个已经排好序的子序列，那么这时其实就可以用二分查找法来替代之前的逐个compare，这样能减少compare的次数。
Pseudocode

1
2
3

for j = 2:n
    insert key A[j] into the already sorted sub array A[1 ... j-1]
    Use binary search to find the right position

二分查找法花费 $\theta(log n)$ 时间，插入之后移动元素花费 $\theta(n)$ 时间。
Complexity:
$\theta(nlog n)$ comparisons, $\theta(n^2)$ swaps

归并排序 (Merge Sort)

采用的是 divide and conquer 的思想。
Pseudocode

1
2
3

If n = 1, done (nothing to sort)
Otherwise, recursively sort A[1 ... n/2] and A[n/2+1 ... n]
"Merge" the two sorted sub-arrays

Complexity:
$\theta(n)$ to merge a total of n elements(linear time)
$T(n) = 2T(n/2) + \theta(n)$
After plotting the recursion tree, we can find that there are 1+lgn layers, and n leaves in the tree. And for each layer, it totally costs cn time. Therefore, complexity is $\theta(nlg n)$
可以看到，在这个树里面，每一层的complexity都是均等的

Solving Recurrences

对于不同的recurrence tree来说，每一层不同的cost决定了它的complexity主要集中在哪里。
如上面的归并排序是每一层都占相等的复杂度。
下面是两个例子：
叶子部分的计算占主要复杂度:

根部分的计算占主要复杂度:

展开全文 >>

猫本交流杂记（一）

2018-11-04

算下来来墨尔本接近半个月了，从10月16号落地到现在，对这边整个center和研究也有些感悟和感慨，就随手记一点吧。

今天主要想来说说这边搞研究和国内乃至香港搞研究的一个比较相反的现象。

有天晚上弄得比较晚（其实也就7点多，但办公室基本没人了…），我们三个中国人（两个PhD学生一个PostDoc）说一起出去吃，然后正好我们老板(澳洲土著)也在，他就说他也一起。这在国内应该是比较少见的，一般来说教授是很少跟自己的学生一起吃饭的，一起吃饭的情况大都是整个实验室聚餐这种。然后我们就一起出去吃了，在一个中国馆子里，期间老板一直负责给我们倒茶水… 看来他已经习惯了中国的饭桌文化。。。

饭间大家讨论到国内各个阶层都追求学位这个现象，说不管怎么样，大家都想努力地去拿本科甚至研究生的学位，因为这在某种程度上代表着一种地位，也意味着将来出来工资的高收入。土著老板就很不解，他说现实社会实际上是哪怕没有读过书的人，也能赚大钱，做生意会做得很好，为什么还要去拿学位，赚钱不就是这个目的吗，那达到这个目的的方式并不一定要读书啊，你可以说 “Hey look, I have a new car. I don’t care if I have that degree”。这就反应了中外国家思维和心态上的不同，实际上也是社会发展程度的一种映射。

在澳洲，不管你做什么，你都能挣很多钱，很多体力或者技工的活儿，挣得不比一个教授的工资少，反而真的是一些白领阶层，才是挣得最少的。对大多数人来说，还没有到“兼济天下”这种情怀，那么其实挣钱就是他们的最终目的，只要能挣钱就行，还读什么书呢，不读书都比你读那么多年的书还挣得多，何必呢。所以在澳洲，那些读研究生的，甚至读PhD的，真的是对科研的热爱才会去做的，因为或许他们有更好的选择去挣大钱，但他们还是想搞搞研究，因为这样他们才觉得真的快乐。所以这也是为什么他们能做出很好的成果的原因吧，毕竟是因为热爱，是真的想去探索。

但在国内和香港不一样，读PhD的目的是为了挣钱，是为了将来有更好的生活，这是社会和人口决定了的，根本还没有精力去追求更高层次的热爱，因为你需要去考虑生活甚至生存。这也是为什么很多PhD读得很痛苦(包括我自己)，读到一半不读了，或者读着读着跑去学cs了。如果每天考虑的是我什么时候才能毕业，做哪个方向更赚钱，那你又怎么有时间去考虑如何把科研做到极致这种事情呢？

还是有点遗憾吧。少了真正对某个事情的激情和热情，人生会少了很多乐趣的。

展开全文 >>

Notes - NodeJS 学习（一）

2018-11-04

The following are just some notes when I learn NodeJS. Most of the contents are taken from others. The references have been listed in the end.
The reason I want to learn it is just because that I need to write codes for my websites…

NodeJS Basics

What is NodeJS?

在了解NodeJS之前，先要了解JS，JS就是JavaScript, 是一种脚本语言(scripting language)，它运行在浏览器中。那什么是脚本语言呢? 脚本语言其实也是一种编程语言(programming language). 传统的编程语言是会有这几个步骤的：编写-编译-链接-运行(edit-compile-link-run)。脚本语言与一般的编程语言的区别在于，他们没有编译(compilation)这个过程，而是通过一种解释的形式来运行。通常来讲，编译型的语言会运行得比解释型语言快，因为他们先转化为及其语言，并且编译器只会读和分析一次代码，最后汇总所有的errors出来。但解释型语言则会在每次遇到一个错误的时候都停下来，而非汇总。
扯远了，那JS是一个脚本语言的话，就需要一个解析器才能运行，对于写在HTML页面里的JS，浏览器充当了解析器的角色。而对于需要独立运行的JS，NodeJS就是一个解析器。
任何操作系统下安装NodeJS本质上做的事情都是把NodeJS执行程序复制到一个目录，然后保证这个目录在系统PATH环境变量下，以便终端下可以使用node命令。

安装及运行

安装

在Windows下，在官网上下载好后，直接运行.msi安装文件即可

运行

在终端运行：Windows下，cmd进入终端，输入 node 就可以进入交互模式了，及之后的运行环境都是在NodeJS下，想退出的话输入 .exit
运行JS文件：当要运行大段代码时，在终端编写代码不是很方便，这时就可以先写一个JS文件(以.js结尾储存)，然后再运行这个文件，假如JS文件名为”hello.js”, 在命令行输入 node hello.js 就可以运行这个脚本程序了。

模块

在编写大型程序时，一般都会将程序模块化，及分成一个一个的子程序。这些子程序在js中就是一个一个的不同的js文件，每个文件就是一个模块，文件路径就是模块名。
在编写每个模块时，都有require、exports、module 三个预先定义好的变量可供使用。
require
- 用于在当前模块中加载和使用别的模块，传入一个模块名，返回一个模块导出对象。
- 模块名可使用相对路径（以./开头），或者是绝对路径（以/或C:之类的盘符开头）。
- 另外，模块名中的.js扩展名可以省略
- e.g. var foo1 = require('./foo');, var foo2 = require('./foo.js');, var foo3 = require('/home/user/foo');, var foo4 = require('/home/user/foo.js'); 都是指导入同一个模块
exports
- 是当前模块的导出对象，用于导出模块公有方法和属性。
- 别的模块通过require函数使用当前模块时得到的就是当前模块的exports对象。
module
- 通过module对象可以访问到当前模块的一些相关信息，但最多的用途是替换当前模块的导出对象。
模块初始化: 一个模块中的JS代码仅在模块第一次被使用时执行一次，并在执行过程中初始化模块的导出对象。之后，缓存起来的导出对象被重复利用。简单点，所有模块都只初始化一次。
主模块：
- 即负责调度组成整个程序的其它模块完成工作。
- 通过命令行参数传递给NodeJS以启动程序的模块被称为主模块。

代码的组织和部署

模块路径的解析

我们在写程序时会用到 require 来告诉程序需要用到哪些模块，让你程序就会去找这些模块，找这些模块的过程就叫做路径解析。那怎么去找呢，分两种情况：

如果是内置模块，即自带的那种，就不做路径解析
node_module目录：这是NodeJS定义的一个专门用来存放模块的目录，在加载某个非内置模块时，NodeJS就会去找 node_module这个文件夹。其实在用Hexo创建静态网页时，你就会发现在你创建的hexo项目下就有个node_modules 文件夹。
NODE_PATH环境变量：NodeJS允许通过NODE_PATH环境变量来指定额外的模块搜索路径。
包(Package)
JS模块的基本单位是单个JS文件，但复杂些的模块往往由多个子模块组成。为了便于管理和使用，我们可以把由多个子模块组成的大模块称做包，并把所有子模块放在同一个目录里。
在组成一个包的所有子模块中，需要有一个入口模块，入口模块的导出对象被作为包的导出对象。
在其它模块里使用包的时候，需要加载包的入口模块。可以直接使用require加上入口模块名称的路径来加载。但是这样不简洁，要让包使用起来更像是单个模块的话，就将模块的文件名设为index.js，这样即使在require路径上不出现index这个名称，程序也会知道这个index.js就是入口模块。
使用package.json 可以自定义入口模块的文件名和存放位置，即：在包目录下包含一个package.json 文件，并在其中指定入口模块的路径:
1
2
3
4
{
"name": modulename
"main": "./lib/main.js" # 这里就指定入口模块的路径
}

工程目录

标准如下：

- /home/user/workspace/node-echo/   # 工程目录
    - bin/                          # 存放命令行相关代码
        node-echo
    + doc/                          # 存放文档
    - lib/                          # 存放API相关代码
        echo.js
    - node_modules/                 # 存放三方包
        + argv/
    + tests/                        # 存放测试用例
    package.json                    # 元数据文件
    README.md                       # 说明文件

关于NPM

NPM是随同NodeJS一起安装的包管理工具，能解决NodeJS代码部署上的很多问题，常见的使用场景有以下几种：

允许用户从NPM服务器下载别人编写的三方包到本地使用。
允许用户从NPM服务器下载并安装别人编写的命令行程序到本地使用。
允许用户将自己编写的包或命令行程序上传到NPM服务器供别人使用。

七天学会NodeJS上在NPM这个讲解上很详细，推荐看一看。

Reference:
http://nqdeng.github.io/7-days-nodejs/
https://www.geeksforgeeks.org/whats-the-difference-between-scripting-and-programming-languages/

展开全文 >>

Dictionary and Motivation

Dictionary problem

Motivation

Prehashing and Hashing

Chaining

Simple uniform hashing

Hash Function

Comparison model

Decision tree

Lower Bounds

Linear-time Sorting

Radix sort

Tree rotation

AVL tree

AVL Insert

General

General command

Exit code

Job Termination Signals

Job Exit Status

Preliminary things

Difference between array and list

Runway Reservation System problem

Binary Search Trees (BST)

Augmenting the BST Structure

Priority Queues

Heap

Heap sort

Matrix 处理

画图

Read and write file

Others

插入排序 (Insertion sort)

Binary Insertion sort

归并排序 (Merge Sort)

Solving Recurrences

NodeJS Basics

What is NodeJS?

安装及运行

安装

运行

模块

代码的组织和部署

模块路径的解析

包(Package)

工程目录

关于NPM