Joins Explained: Sort Merge Joins

This is the second article in the Joins Explained series. After looking into the nested loop join last time, the focus will be on the sort merge join this time.

Let’s continue to look at the example of a CD retailer used in the first article of the series. This time the store manager wants to reconcile the inventory. He has a list of all the albums that should be in the store. The manager decides to print the inventory list ordered by artist, since this is the way that the CDs are arranged in the shelves. Then he walks through the shop, collects all misplaced CDs and makes sure that all items are placed at the correct place according to the sorting criterion.

When he has arranged all CDs he picks up the inventory list, walks to the first shelf and compares the number of the first CD in stock to the number on his list. The second item on the list is the CD that is located just next to the first one. It must be that way, since the inventory list and the items in the shelves are ordered the same way. So the process is a simple comparison of the next item of the first set to the next item of the second set. He finishes when he reaches the end of his list.

Analysis

Both inputs must be sorted by the same criterion before the manager can start with the actual inventory. In case of the inventory list this is easy, because the list is already in the correct order. In case of the CDs in the store it is much harder, because he has to find all misplaced items and put them into the correct place.

There will be a delay before the first result can be provided, since the comparison can only start when all items are ordered.

Assuming that the sorting is done, each item will only be looked at once.

In general the output will be ordered the same way as the input when the job is done by one person. But it would be possible to split the job between different people and have each person work on one partition. In this case the first result would be delivered by the first person done and the output appears unordered.

Conclusion

Most of the resource consumption of the sort merge join is determined by the sorting of the two input sets. This makes it a good choice for small to medium data sets where the sort is reasonable fast and the output will be a significant part of the input.

The performance of the join starts to suffer when the sort operation has to use temporary space on disk to perform the sort. In this case it helps to increase the configuration setting work_mem.

The sorting is a cheap operation if the input is already sorted. In some cases the already sorted output may even save a sort operation in the following operation.

There is no designated driving set as both inputs are handled the same way during the join. So there is no need to have an index to support the join. On the other hand a missing index might be the reason why the planner won’t choose a nested loop to do the join.

The sort merge join is a candidate for parallel processing. This is currently not implemented in PostgreSQL but it might be a feature of a future version.