Joins Explained: Outer Joins

In the first parts of the series I tried to explain a bit of what the database system actually does when it executes the different join types. I presented details of the three widespread implementions: the Nested Loop, the Sort Merge Join and the Hash Join. In this article I will show some implications of using outer joins.

Generally the database system is free to choose whatever algorithm it thinks is adequate to answer the query. One of the situations where you put an additional constraint on the planner is when you specify an outer join between two data sets. If you are new to SQL you can find a description and examples of outer joins at Wikipedia.

If you look into the previous parts of the series you will see that the nested loop join and the hash join will use one of the data sets as a driver to look into the other data set. In this case there might be items in the second set that won’t be visited. With inner joins this is perfectly alright as the inner join is supposed to only provide the matching items as result.

The direction of the join suddenly matters when an outer join is specified. The database system is forced to take that data set as the driver (or choose a different algorithm) because otherwise it will be unable to deliver the items that have no matching counterpart. This may lead to situations where the planner might discard nested loop or hash joins as candidates and choose a sort merge join.

The following example might clarify the impact of an outer join. I will use two tables for this test. The table airport contains data for more than 3000 airports around the world. The table country has less than 250 rows and is referenced from airport by means of a foreign key country_id.

We start by doing a simple inner join:

stm@postgres=# explain select * from airport join country using (country_id);
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=7.54..141.03 rows=3642 width=101)
   Hash Cond: (airport.country_id = country.country_id)
   ->  Seq Scan on airport  (cost=0.00..83.42 rows=3642 width=82)
   ->  Hash  (cost=4.46..4.46 rows=246 width=23)
         ->  Seq Scan on country  (cost=0.00..4.46 rows=246 width=23)
(5 rows)

The planner decides to build a hash map on country and then do a sequential scan on airport to perform the hash lookups. The size of the tables differ significantly and the planner expects more than just a couple of rows as a result so the choice is reasonable.

Now we try a left outer join:

<pre><code>stm@postgres=# explain select * from airport left outer join country using (country_id);
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Left Join  (cost=7.54..141.03 rows=3642 width=101)
   Hash Cond: (airport.country_id = country.country_id)
   ->  Seq Scan on airport  (cost=0.00..83.42 rows=3642 width=82)
   ->  Hash  (cost=4.46..4.46 rows=246 width=23)
         ->  Seq Scan on country  (cost=0.00..4.46 rows=246 width=23)
(5 rows)

You can easily see that the same plan is chosen and even the calculated costs are identical to the first plan. Only the last step is labeled differently. This outer join does not change the plan because the driving data set is the same that the planner prefers for the inner join. So we don’t loose anything in this case.

The final test replaces the left outer join with a right outer join:

<pre><code>stm@postgres=# explain select * from airport right outer join country using (country_id);
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Merge Left Join  (cost=298.85..372.60 rows=3642 width=101)
   Merge Cond: (country.country_id = airport.country_id)
   ->  Index Scan using country_pk on country  (cost=0.00..18.50 rows=246 width=23)
   ->  Sort  (cost=298.85..307.96 rows=3642 width=82)
         Sort Key: airport.country_id
         ->  Seq Scan on airport  (cost=0.00..83.42 rows=3642 width=82)
(6 rows)

Now there is a difference. In this case the planner would have to use the country table as driver. Since the foreign key column is not indexed it would lead to an expensive nested loop. The second possibility would be a hash join using a hash map of the airport table. But here the planner has chosen to use a sort merge join instead. The higher costs show that the planner considers this statement to be slower compared to the previous tests. Running the query on my test database shows an execution time of 29.4ms for the left outer join compared to 45.0ms for the right outer join. In this case the planner simply has no choice to do better as we specifically requested this form of outer join.

Don’t be confused by the left join expression in the plan by the way. The planner simply decided to swap the sequence of tables here.

In addition to the left and right outer join there is also a full outer join available. This one is a combination as it returns items from both sets that have no matching counterpart. During testing the full outer join also lead to a sort merge join being used.

Conclusion

Choose outer joins wisely and only when the data requires them. Otherwise you might force the planner to reject better execution plans.