Introduction

Hashing
A hash function takes data (like a string, or a file’s contents) and outputs a hash, a fixed-size string or number. For example, here’s the MD5 hash (MD5 is a common hash function) for a file simply containing “cake”:
DF7CE038E2FA96EDF39206F898DF134D
And here’s the hash for the same file after it was edited to be “cakes”:
0E9091167610558FDAE6F69BD6716771
Notice the hash is completely different, even though the files were similar. Here's the hash for a long film I have on my hard drive:
664f67364296d08f31aec6fea4e9b83f
The hash is the same length as my other hashes, but this time it represents a much bigger file—461Mb.
We can think of a hash as a "fingerprint." We can trust that a given file will always have the same hash, but we can't go from the hash back to the original file. Sometimes we have to worry about multiple files having the same hash value, which is called a hash collision.
Some uses for hashing:
- Dictionaries. Suppose we want a list-like data structure with constant-time lookups, but we want to look up values based on arbitrary "keys," not just sequential "indices." We could allocate a list, and use a hash function to translate keys into list indices. That's the basic idea behind a dictionary!
- Preventing man-in-the-middle attacks. Ever notice those things that say "hash" or "md5" or "sha1" on download sites? The site is telling you, "We hashed this file on our end and got this result. When you finish the download, try hashing the file and confirming you get the same result. If not, your internet service provider or someone else might have injected malware or tracking software into your download!"
Hashing is the most common example of a space-time tradeoff. Instead of linearly searching an array every time to determine if an element is present, which takes O(n) time, we can traverse the array once and hash all the elements into a hash table. Determining if the element is present is a simple matter of hashing the element and seeing if it exists in the hash table, which is O(1) on average.
In the case of hash collisions, there are a number of collision resolution techniques that can be used. You will unlikely be asked about details of collision resolution techniques in interviews:
- Separate chaining - A linked list is used for each value, so that it stores all the collided items.
- Open addressing - All entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found.
Hashing (hash function)
- in a hash table, a new index is processed using the keys, and, the element corresponding to that key is stored in the index → this process is called hashing
- hashing is a technique of mapping a large set of arbitrary data to tabular indexes using a hash function, it is a method for representing dictionaries for large datasets
- linear search and binary search perform lookups/search with time complexity of
O(n)
and O(log(n)
) respectively, as the size of the dataset increases, these complexities also become significantly high which is not acceptable.
- we needed a technique that does not depend on the size of data, so hashing allows lookups to occur in constant time i.e.
O(1)
- let k be a key and h(x) be a hash function.
- h(k) will give us a new index to store the element linked with k