MIT6006 Lec09 Hashing II

2018-11-19

Table Resizing and Amortization

In ‘Hashing I’, we know that we will create a hash table with chaining to save the data. The size of the table is m, which is smaller than n, the number of keys(data). But then one problem comes, how to choose m? the size of the table.

want $m=\theta(n)$ at all times.
but don’t know how large n will get at creation
if m is too small ⇒ slow; m too big ⇒ wasteful

The idea is to start small (constant) and grow (or shrink) as necessary.

Rehashing
When n reaches m, grow the table.

If grow the table by m+=1
- need to rebuild every step
- n inserts cost $Θ(1+2+ \cdots +n) = Θ(n2)$
If grow the table by m ∗ =2
- just to rebuild at insertion $2^i$
- n inserts cost $Θ(1+2+4+8+ \cdots +n)$ where n is really the next power of $2=Θ(n)$ (This is called table doubling)

Pseudo code
Insert:

# m is the size of the table, n is the number of the elements in that table
function Insert(x, A, m, n)
    if (n = m) then
        Create new array A2 with size 2m;
        for i ← 1 to m do
            A2[i] ← A[i];
        end
        Replace array A with A2;
        m ← 2m;
    end
    n ← n + 1;
    A[n] ← x;

Delete:

function Delete(x, A, m, n)
    x ← A[n];
    n ← n − 1;
    if (n ≤ m/4) and (m ≥ 4) then
        Create new array A2 with size m/2;
        for i ← 1 to n do
            A2[i] ← A[i];
        end
        Replace array A with A2;
        m ← m/2;
    end

Amortization

operation has amortized cost $T(n)$ if $k$ operations cost $≤ k · T(n)$
“T(n) amortized” roughly means T(n) “on average”, but averaged over all operations.
e.g. inserting into a hash table takes O(1) amortized time.

String Matching and Karp-Rabin

Given two strings s and t, does s occur as a substring of t?

Simple Algorithm
any(s == t[i : i + len(s)] for i in range(len(t) − len(s)))
— O(|s|) time for each substring comparison
⇒ O(|s| · (|t| − |s|)) time
= O(|s| · |t|)
potentially quadratic

Karp-Rabin Algorithm

Compare h(s) == h(t[i : i + len(s)])
If hash values match, likely so do strings
- can check s == t[i : i + len(s)] to be sure ∼ cost $O(|s|)$
- if yes, found match — done
- if no, happened with probability < $\frac 1{|s|}$ ⇒ expected cost is $O(1)$ per i.
need suitable hash function.
expected time = O(|s| + |t| · cost(h)).
- naively h(x) costs |x|
- we’ll achieve O(1)!
- idea: t[i : i + len(s)] ≈ t[i + 1 : i + 1 + len(s)]

So using Rabin Karp algorithm needs to calculate hash values for strings first. The hash function suggested by Rabin and Karp calculates an integer value. The integer value for a string is numeric value of a string.

Rolling Hash

The idea is to compute the hash value of a substring using the hash value of previous substring if you use the right hash function. A polynomial with the string characters as coefficients works well.

Let the following be the hash function for string c:
$$H(c) = c[0]a^{n-1} + c[1]a^{n-2} + \cdots + c[n-1]$$
All the operations will be done modulo a prime number so that we don’t have to deal with large integers.

List an example for calculating the hash value for a string:
$$H(S[i+1 … i+n]) = S[i+1]a^{n-1} + S[i+2]a^{n-2} + … + S[i+n]$$
where n is the length of the string that we want to match
By adding and substracting $S[i]a^n$, get $$a(S[i]a^{n-1} + S[i+1]a^{n-2} + S[i+2]a^{n-3} + … + S[i+n-1]) + S[i+n] = aH(S[i … i+n-1]) - S[i]a^n + S[i+n]$$

Reference:
http://web.cse.ohio-state.edu/~rademacher.10/Sp16_2331/datastructII.pdf
https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/
https://www.infoarena.ro/blog/rolling-hash
https://brilliant.org/wiki/rabin-karp-algorithm/