Why do we have double hashing function as [(hash1(key) + i * hash2(key)) % TABLE_SIZE] but not simply as [(i * hash2(key)) % TABLE_SIZE]?

Question

I learned the notation of double hashing [(hash1(key) + i * hash2(key)) % TABLE_SIZE] couple days ago. There is a part I couldn't understand after thinking about it and searching for answer for days.

Why don't we discard the [hash1(key)] part from the double hashing function and simply make it as [(i * hash2(key)) % TABLE_SIZE]?

I couldn't find any downside of doing this, except all the hashcodes would start from 0 (when i = 0). The main purpose of using double hashing, avoiding clusters, can still be achieved.

It will be super thankful if anyone can help:D

templatetypedef · Accepted Answer

A quick summary of this answer:

There is a practical performance hit to your modified version, though it's not very large.
I think this is due to there not being as many different probe sequences as in regular double hashing, leading to some extra collisions due to the birthday paradox.

Now, the actual answer. :-)

Let's start off with some empirical analysis. What happens if you switch from the "standard" version of double hashing to the variant of double hashing that you've proposed?

I wrote a C++ program that generates uniformly-random choices of h₁ and h₂ values for each of the elements. It then inserts them into two different double-hashing tables, one using the normal approach and one using the variant. It repeats this process multiple times and reports the average number of probes required across each insertion. Here's what I found:

#include <iostream>
#include <vector>
#include <random>
#include <utility>
using namespace std;

/* Table size is picked to be a prime number. */
const size_t kTableSize = 5003;

/* Load factor for the hash table. */
const double kLoadFactor = 0.9;

/* Number of rounds to use. */
const size_t kNumRounds = 100000;

/* Random generator. */
static mt19937 generator;

/* Creates and returns an empty double hashing table. */
auto emptyTable(const size_t numSlots) {
  return vector<bool>(numSlots, false);
}

/* Simulation of double hashing. Each vector entry represents an item to store.
 * The first element of the pair is the value of h1(x), and the second element
 * of the pair is the value of h2(x).
 */
auto hashCodes(const size_t numItems) {
  vector<pair<size_t, size_t>> result;
  
  uniform_int_distribution<size_t> first(0, kTableSize - 1), second(1, kTableSize - 1);
  
  for (size_t i = 0; i < numItems; i++) {
    result.push_back({ first(generator), second(generator) });
  }
  
  return result;
}

/* Returns the probe location to use given a number of steps taken so far.
 * If modified is true, we ignore h1.
 */
size_t locationOf(size_t tableSize, size_t numProbes, size_t h1, size_t h2, bool modified) {
  size_t result = (numProbes == 0 || !modified)? h1 : 0;
  result += h2 * numProbes;
  return result % tableSize;
}

/* Performs a double-hashing insert, returning the number of probes required to 
 * settle the element into its place.
 */
size_t insert(vector<bool>& table, size_t h1, size_t h2, bool modified) {
  size_t numProbes = 0;
  while (table[locationOf(table.size(), numProbes, h1, h2, modified)]) {
    numProbes++;
  }
  
  table[locationOf(table.size(), numProbes, h1, h2, modified)] = true;
  return numProbes + 1; // Count the original location as one probe
}

int main() {
  size_t normalProbes = 0, variantProbes = 0;

  for (size_t round = 0; round < kNumRounds; round++) {
    auto normalTable  = emptyTable(kTableSize);
    auto variantTable = emptyTable(kTableSize);

    /* Insert a collection of items into the table. */
    for (auto [h1, h2]: hashCodes(kTableSize * kLoadFactor)) {
      normalProbes  += insert(normalTable, h1, h2, false);
      variantProbes += insert(variantTable, h1, h2, true);
    }
  }
  
  cout << "Normal probes:  " << normalProbes << endl;
  cout << "Variant probes: " << variantProbes << endl;
}

Output:

Normal probes:  1150241942
Variant probes: 1214644088

So, empirically, it looks like the modified approach leads to about 5% more probes being needed to place all the elements. The question, then, is why this is.

I do not have a fully-developed theoretical explanation as to why the modified version is slower, but I do have a reasonable guess as to what's going on. Intuitively, double hashing works by assigning each element that's inserted a random probe sequence, which is some permutation of the table slots. It's not a truly-random permutation, since not all permutations can be achieved, but it's random enough for some definition of "random enough" (see, for example, Guibas and Szemendi's "The Analysis of Double Hashing).

Let's think about what happens when we do an insertion. How many times, on expectation, will we need to look at the probe sequence beyond just h₁? The first item has 0 probability of needing to look at h₂. The second item has 1/T probability, since it hits the first element with probability 1/T. The second item has 2/T probability, since it has a 2/T chance of hitting the first two items. More generally, using linearity of expectation, we can show that the the expected number of times an item will be in a spot that's already taken is given by

1/T + 2/T + 3/T + 4/T + ... + (n-1)/T

=(1+2+3+...+(n-1))/T

= n(n-1)/2T

Now, let's imagine that the load factor on our hash table is α, meaning that αT = n. Then the expected number of collisions works out to

αT(αT - 1) / 2T

≈ α²T / 2.

In other words, we should expect to see a pretty decent number of times where we need to inspect h₂ when using double hashing.

Now, what happens when we look at the probe sequence? The number of different probe sequences using traditional double hashing is T(T-1), where T is the number of slots in the table. This is because there are T possible choices of h₁(x) and T-1 choices for h₂(x).

The math behind the birthday paradox says that once approximately √(2T(T-1)) ≈ T√2 items have been inserted into the table, we have a 50% chance that two of them will end up having the same probe sequence assigned. The good news here is that it's not possible to insert T√2 items into a T-slot hash table - that's more elements than slots! - and so there's a fairly low probability that we see elements that get assigned the same probe sequences. That means that the conflicts we get in the table are mostly due to collisions between elements that have different probe sequences, but coincidentally end up landing smack on top of one another.

On the other hand, let's think about your variation. Technically speaking, there are still T(T-1) distinct probe sequences. However, I'm going to argue that there are "effectively" more like only T-1 distinct probe sequences. The reason for this is that

probe sequences don't really matter unless you have a collision when you do an insertion, and
once there's a collision after an insertion, the probe sequence for a given element is determined purely by its h₂ value.

This is not a rigorous argument - it's more of an intuition - but it shows that we have less variation in how our permutations get chosen.

Because there are only T-1 different probe sequences to pick from once we've had a collision, the birthday paradox says that we need to see about √(2T) collisions before we find two items with identical values of h₂. And indeed, given that we see, on expectation, α²T/2 items that need to have h₂ inspected, this means that we have a very good chance of finding items whose positions will be determined by the exact same sequence of h₂ values. This means that we have a new source of collisions compared with "classical" double hashing: collisions from h₂ values overlapping one another.

Now, even if we do have collisions with h₂ values, it's not a huge deal. After all, it'll only take a few extra probes to skip past items placed with the same h₂ values before we get to new slots. But I think that this might be the source of the extra probes seen during the modified version.

Hope this helps!

Why do we have double hashing function as [(hash1(key) + i * hash2(key)) % TABLE_SIZE] but not simply as [(i * hash2(key)) % TABLE_SIZE]?

Tags:

algorithm

data-structures

hashcode

hash

Y.Wang

1 Answers

templatetypedef

Recent Activity

Donate For Us

Why do we have double hashing function as [(hash1(key) + i * hash2(key)) % TABLE_SIZE] but not simply as [(i * hash2(key)) % TABLE_SIZE]?

Tags:

algorithm

data-structures

hashcode

hash

Y.Wang

1 Answers

templatetypedef

Related questions

Recent Activity

Donate For Us