Hash Table

Hash Table#

A hash table is a collection of items which are stored in such a way as to make it easy to find them later. In order to do this, we will need to know even more about where the items might be when we go to look for them in the collection. If every item is where it should be, then the search can use a single comparison to discover the presence of an item. Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0.

The mapping between an item and the slot where that item belongs in the hash table is called the hash function. The hash function will take any item in the collection and return an integer in the range of slot names, between 0 and m-1.

Ratio of number of items in a hash table to size of hash table is called load factor.

Hash Functions#

Given a collection of items, a hash function that maps each item into a unique slot is referred to as a perfect hash function. If we know the items and the collection will never change, then it is possible to construct a perfect hash function (refer to the exercises for more about perfect hash functions).

Unfortunately, given an arbitrary collection of items, there is no systematic way to construct a perfect hash function. Luckily, we do not need the hash function to be perfect to still gain performance efficiency.

One way to always have a perfect hash function is to increase the size of the hash table so that each possible value in the item range can be accommodated. This guarantees that each item will have a unique slot. Although this is practical for small numbers of items, it is not feasible when the number of possible items is large.

A good hash function has following properties:

It minimizes the number of collisions
It is easy to compute.
It evenly distributes the items in the hash table.

The folding method for constructing hash functions begins by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value.

Example

The phone number 436-555-4601, we would take the digits and divide them into groups of 2 (43,65,55,46,01). After the addition, 43+65+55+46+01, we get 210. If we assume our hash table has 11 slots, then we need to perform the extra step of dividing by 11 and keeping the remainder. In this case 210 % 11 is 1, so the phone number 436-555-4601 hashes to slot 1.

Another numerical technique for constructing a hash function is called the mid-square method. We first square the item, and then extract some portion of the resulting digits.

Example

If the item were 44, we would first compute 44^2=1,936. By extracting the middle two digits, 93, and performing the remainder step, we get 5 (93 % 11) so the number 44 hashes to slot 5.

Collision Resolution#

When two items hash to the same slot it is called collision, we must have a systematic method for placing the second item in the hash table. This process is called collision resolution. If the hash function is perfect, collisions will never occur. However, since this is often not possible, collision resolution becomes a very important part of hashing.

The general name for this process of looking for another slot after a collision is rehashing. In general, 𝑟𝑒ℎ𝑎𝑠ℎ(𝑝𝑜𝑠)=(𝑝𝑜𝑠+𝑠𝑘𝑖𝑝)%𝑠𝑖𝑧𝑒𝑜𝑓𝑡𝑎𝑏𝑙𝑒. It is important to note that the size of the “skip” must be such that all the slots in the table will eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often suggested that the table size be a prime number.

One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty, we may need to go back to the first slot (circularly) to cover the entire hash table. This collision resolution process is referred to as open addressing in that it tries to find the next open slot or address in the hash table. By systematically visiting each slot one at a time, we are performing an open addressing technique called linear probing. A disadvantage to linear probing is the tendency for clustering; items become clustered in the table.
A variation of the linear probing idea is called quadratic probing. Instead of using a constant “skip” value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on. This means that if the first hash value is h, the successive values are ℎ+1, ℎ+4, ℎ+9, ℎ+16, and so on. In general, the i will be i^2 𝑟𝑒ℎ𝑎𝑠ℎ(𝑝𝑜𝑠)=(ℎ+𝑖^2). In other words, quadratic probing uses a skip consisting of successive perfect squares.
An alternative method for handling the collision problem is to allow each slot to hold a reference to a collection (or chain) of items. Chaining allows many items to exist at the same location in the hash table. When collisions happen, the item is still placed in the proper slot of the hash table.

Hash Table Abstract Data Type#

The hash table abstract data type is defined as follows. The structure is an unordered collection of associations between a key and a data value. The keys in a hash table are all unique so that there is a one-to-one relationship between a key and a value.

Hash Table Operations#

HashTable() Create a new, empty hash table. It returns an empty hash table collection.
put(key,val) Add a new key-value pair to the hash table. If the key is already in the hash table then replace the old value with the new value.
get(key) Given a key, return the value stored in the hash table or None otherwise.
delete(key) Delete the key-value pair from the hash table.
length() Return the number of key-value pairs stored in the hash table.
in Return True for a statement of the form key in hash table, if the given key is in the hash table, False otherwise.

Implementing a Hash Table in Python#

An abstract data type (ADT) when given a physical implementation then we refer to the implementation as a data structure.

In any object-oriented programming language, the implementation of choice for an abstract data type such as a Hash Table is the creation of a new class. The Hash Table operations are implemented as methods. Further, to implement a Hash Table, which is a associative data type where you can store key–data pairs. The key is used to look up the associated data value.

In Python, the Dictionary data types represent the implementation of hash tables. The Keys in the dictionary satisfy the following requirements. The keys of the dictionary are hashable i.e. the are generated by hashing function which generates unique result for each unique value supplied to the hash function.

class HashTable:

    def __init__(self):
        self.size = 11
        self.num_items = 0
        self.keys = [None] * self.size
        self.values = [None] * self.size
    
    @staticmethod
    def hash(value, size):
        return value % size
    
    @staticmethod
    def rehash(old_hash, skip, size):
        return (old_hash + skip) % size
    
    def put(self, key, value):
        self.num_items += 1
        
        hash_value = self.hash(key, self.size)

        if self.keys[hash_value] is None or self.keys[hash_value] == key:
           self.keys[hash_value] = key
           self.values[hash_value] = value
           return
        
        rehash_value = self.rehash(hash_value, 1, self.size)
        
        while self.keys[rehash_value] != None and self.keys[rehash_value] != key:
            rehash_value = self.rehash(rehash_value, 1, self.size)
        
        if self.keys[rehash_value] == None or self.keys[rehash_value] == key:
           self.keys[rehash_value] = key
           self.values[rehash_value] = value
           return
    
    def get(self, key):
        hash_value = self.hash(key, self.size)

        if self.keys[hash_value] == key:
           return self.values[hash_value]
        
        rehash_value = self.rehash(hash_value, 1, self.size)
        
        while self.keys[rehash_value] != None and self.keys[rehash_value] != key:
            rehash_value = self.rehash(rehash_value, 1, self.size)
        
        return self.values[rehash_value]
    
    def delete(self, key):
        self.num_items -= 1
        
        hash_value = self.hash(key, self.size)

        if self.keys[hash_value] is None or self.keys[hash_value] == key:
           self.keys[hash_value] = None
           self.values[hash_value] = None
           return
        
        rehash_value = self.rehash(hash_value, 1, self.size)
        
        while self.keys[rehash_value] != None and self.keys[rehash_value] != key:
            rehash_value = self.rehash(rehash_value, 1, self.size)
        
        if self.keys[rehash_value] == None or self.keys[rehash_value] == key:
           self.keys[rehash_value] = None
           self.values[rehash_value] = None
           return
    
    def length(self):
        return self.num_items

# Create an emplty hash table
h = HashTable()

# Put few items with keys and values
h.put(54, "cat")
h.put(26, "dog")
h.put(93, "lion")
h.put(17, "tiger")
h.put(77, "bird")
h.put(44, "goat")

# Give the value with key 26
print(h.get(26))

# Give the value with key 44
print(h.get(44))

# Give length of the hash table
print(h.length())

# Put few more items with keys and values
h.put(31, "cow")
h.put(55, "pig")
h.put(20, "chicken")

# Delete value with key 44 
h.delete(44)

# Give length of the hash table
print(h.length())

dog
goat
6
8

Bonus Material#

Modern Python Dictionaries - A confluence of a dozen great ideas
- By Raymond Hettinger
Python Hash Sets Explained & Demonstrated - Computerphile
- By Computerphile