In this article in the Joins Explained series I will look at the hash join. Understanding how the hash join works is a bit awkward, since it involves a hash function and that is a concept that humans consciously don’t use.
A hash function maps a possibly complex input value to a much simpler output value. Usually the output value allows very fast access to the data belonging to the input (e.g. by using it as an array index). If the function can be computed efficiently this gives a quick access path to the original data even in comparison to a search supported by an index.
We look again at an example involving the CD retailer used in the first parts of the series. This time you walk around in the store and glance at the CDs hoping to find a bargain. You have a preference for a certain style of music and certain artists. And although it would be difficult to enumerate all your favorite artists you easily recognize them if you spot the CD. This sort of imitates the fast hash lookup.
So you flick through the shelves and look for albums you might want to buy. For each album you quickly reach a conclusion depending on whether you like the artist or not. Having memorized your favorite artists is the only prerequisite for this to work. If you get a match then you take the album if all other constraints (like the price) are satisfied.
The order in which you scan the shelves is not really important and you simply finish when you have worked through every single shelf.
This algorithm also has a startup penalty by forcing you to build a hash table of one of the input sets before the actual join can start. Compared to the sort merge join this is easier as it involves only one sweep through the data and therefore the computational complexity is O(n) while sorting won’t be better than O(n log n). So in general you can build the hash faster than you would be able to do the sorting.
The hash also needs to be computed for only one input set and this could be the smaller one.
The results of a hash function are generally not comparable except for equality.
Building the hash table is not only faster than sorting the data, but it also has to be performed for only one of the input sets. This makes the hash join often the preferred choice over the sort merge join.
Generally speaking the hash join wins when you expect a large result set and the nested loop join wins when you expect a small result set. One of the most dominant problems in this area is the planner getting the estimates wrong and taking the other path.
The hash join can only be used if you use equality as join relation.
Building the hash needs temporary space and increasing
The hash join is also a good candidate for parallel processing. Again, this is not implemented in PostgreSQL as of today.