Table Resizing and Amortization
In ‘Hashing I’, we know that we will create a hash table with chaining to save the data. The size of the table is m
, which is smaller than n
, the number of keys(data). But then one problem comes, how to choose m
? the size of the table.
- want $m=\theta(n)$ at all times.
- but don’t know how large n will get at creation
- if m is too small ⇒ slow; m too big ⇒ wasteful
The idea is to start small (constant) and grow (or shrink) as necessary.
Rehashing
When n reaches m, grow the table.
- If grow the table by
m+=1
- need to rebuild every step
- n inserts cost $Θ(1+2+ \cdots +n) = Θ(n2)$
- If grow the table by
m ∗ =2
- just to rebuild at insertion $2^i$
- n inserts cost $Θ(1+2+4+8+ \cdots +n)$ where n is really the next power of $2=Θ(n)$ (This is called table doubling)
Pseudo code
Insert:1
2
3
4
5
6
7
8
9
10
11
12# m is the size of the table, n is the number of the elements in that table
function Insert(x, A, m, n)
if (n = m) then
Create new array A2 with size 2m;
for i ← 1 to m do
A2[i] ← A[i];
end
Replace array A with A2;
m ← 2m;
end
n ← n + 1;
A[n] ← x;
Delete:1
2
3
4
5
6
7
8
9
10
11function Delete(x, A, m, n)
x ← A[n];
n ← n − 1;
if (n ≤ m/4) and (m ≥ 4) then
Create new array A2 with size m/2;
for i ← 1 to n do
A2[i] ← A[i];
end
Replace array A with A2;
m ← m/2;
end
Amortization
- operation has amortized cost $T(n)$ if $k$ operations cost $≤ k · T(n)$
- “T(n) amortized” roughly means T(n) “on average”, but averaged over all operations.
- e.g. inserting into a hash table takes O(1) amortized time.
String Matching and Karp-Rabin
Given two strings s and t, does s occur as a substring of t?
Simple Algorithm
any(s == t[i : i + len(s)] for i in range(len(t) − len(s)))
— O(|s|) time for each substring comparison
⇒ O(|s| · (|t| − |s|)) time
= O(|s| · |t|)
potentially quadratic
Karp-Rabin Algorithm
- Compare h(s) == h(t[i : i + len(s)])
- If hash values match, likely so do strings
- can check s == t[i : i + len(s)] to be sure ∼ cost $O(|s|)$
- if yes, found match — done
- if no, happened with probability < $\frac 1{|s|}$ ⇒ expected cost is $O(1)$ per i.
- need suitable hash function.
- expected time = O(|s| + |t| · cost(h)).
- naively h(x) costs |x|
- we’ll achieve O(1)!
- idea: t[i : i + len(s)] ≈ t[i + 1 : i + 1 + len(s)]
So using Rabin Karp algorithm needs to calculate hash values for strings first. The hash function suggested by Rabin and Karp calculates an integer value. The integer value for a string is numeric value of a string.
Rolling Hash
The idea is to compute the hash value of a substring using the hash value of previous substring if you use the right hash function. A polynomial with the string characters as coefficients works well.
Let the following be the hash function for string c:
$$H(c) = c[0]a^{n-1} + c[1]a^{n-2} + \cdots + c[n-1]$$
All the operations will be done modulo a prime number so that we don’t have to deal with large integers.
List an example for calculating the hash value for a string:
$$H(S[i+1 … i+n]) = S[i+1]a^{n-1} + S[i+2]a^{n-2} + … + S[i+n]$$
where n is the length of the string that we want to match
By adding and substracting $S[i]a^n$, get $$a(S[i]a^{n-1} + S[i+1]a^{n-2} + S[i+2]a^{n-3} + … + S[i+n-1]) + S[i+n] = aH(S[i … i+n-1]) - S[i]a^n + S[i+n]$$
Reference:
http://web.cse.ohio-state.edu/~rademacher.10/Sp16_2331/datastructII.pdf
https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/
https://www.infoarena.ro/blog/rolling-hash
https://brilliant.org/wiki/rabin-karp-algorithm/