Understanding JOINs in SQL

Understanding JOINs in SQL

by boxplotanalytics    Sep 1, 2019 2:01 am  

Joins in SQL

Joins are one of the most important (if not THE most important) concepts in SQL. If you take the time to solidly understand how joins work, you’ll be in an excellent place for writing queries. So, let’s dive in!

Join Definitions

Joining tables in SQL is a way of combining them. It is directly comparable to a VLOOKUP in Excel, so if you are familiar wtih VLOOKUPs you already understand how joins work. For example, let’s say you have a table that contains orders placed by your customers (each row in that table represents a unique order). Then you have a second table that is only items that have been returned by your customers (each row in that table represents a unique order too, but not all orders are returned so the table has fewer rows).

Orders Table

order_idproductcost_per_unittotal_amountprofitshipping_method
1A5705Air
2B1012097Air
3C15150135Air
4A52010Air
5B1011096Air
6C15305Air
7A5459Air
8B1015064Air
9C1521063Air
10A56544Air
11B1012076Air
12C153018Air
13A52019Ground
14B1010059Ground
15C153014Ground
16A55515Ground
17B10101Ground
18C1510528Ground
19A57537Ground
20B105041Ground
21C15159Ground
22A53022Ground
23B10120117Ground
24C15195166Ground
25A5201Ground

Returns Table

idamount_returnedReason
1510Broken
2115No Reason Given
2485Unsatisfied
715Unsatisfied
1220Unsatisfied
9130Unsatisfied
1710No Reason Given
2115No Reason Given
1870Broken
1710Broken

You can’t just copy and paste these tables next to each other to figure out what was returned, because the rows won’t match up. You need a way of comparing each row in the orders table to the returns table to see if a match exists, and then bring back the information from the returns table that you’re interested in. Enter joins!

Let’s take a look at join syntax first:


SELECT *
FROM table_1 INNER JOIN table_2
ON table_1.id = table_2.id
	

Notice the ON statement, which is required for every join. You need to pick a column from each table that you want SQL to compare in order to join them together, and that’s what the ON statement is specifying. This example says compare the id column in table_1 to the id column in table_2 when trying to match up rows between the two tables during the join.

How do you choose which colunns to use for the ON statement? Generally you want to pick a column that uniquely identifies each row, like an id number. The two columns don’t have to have the exact same name, but they need to be the same type (for example, integer or text) and represent the same thing. The order_id columnn from the orders table above is the same thing as the id column in the returns table – they both represent the id of the order.

There are four types of joins: left, right, inner and outer. The right join functionally does the same thing as the left join, so many SQL programs are choosing not even to support it anymore. For that reason, we won’t be covering it in this article.

As usual, we’ll start with a basic example. Say you have a database with the following two tables in it:

nutrition1

idnamecalciumserving_sizeweight
1apples101 ea138
2asparagus221/2 cup90
3avocado191 ea173
4bamboo shoots101 cup131
5banana71 ea114
6beets91/2 cup85
7blackberries461 cup144
8Blueberries raw91 cup145
9broccoli2051 spear180
10brussels sprouts561 cup156

nutrition2

idpercent_waterenergyproteinfood_type
184800.3FRUITS
292222.3VEGETABLES AND LEGUMES
3733054FRUITS
494252.3VEGETABLES AND LEGUMES
5741051.2FRUITS
1190942.4FRUITS
1288310.7VEGETABLES AND LEGUMES
1392151.2VEGETABLES AND LEGUMES
149560.3VEGETABLES AND LEGUMES
1581490.8FRUITS

The nutrition1 table contains information about foods, and the nutrition2 table contains different information about foods. We’re going to join on the id column (this will be our ON statement) because it’s the unique identifier for each row in both of the tables. So, take note that nutrition1 has id numbers 1-10, but nutrition2 has id numbers 1-5 and then skips to 11-15.

Inner Joins

When you inner join two different tables together, SQL will compare the columns you specify in the ON statement and only keep rows that those columns have in common. Looking at our nutrition tables, the only id numbers that the two have in common are 1 through 5. So when we run this query:


SELECT *
FROM nutrition1 INNER JOIN nutrition2
ON nutrition1.id = nutrition2.id

we’ll get this result:

idnamecalciumserving_sizeweightidpercent_waterenergyproteinfood_type
1apples101 ea138184800.3FRUITS
2asparagus221/2 cup90292222.3VEGETABLES AND LEGUMES
3avocado191 ea1733733054FRUITS
4bamboo shoots101 cup131494252.3VEGETABLES AND LEGUMES
5banana71 ea1145741051.2FRUITS

Cool! Now we have more information than we previously did about apples, asparagus, etc. Notice that the result displays 1-5 from nutrition1, lined up next to those columns 1-5 from nutrition2, and everything else is gone. No other rows appear in the result because inner joins only keep what the two tables have in common, and nothing else.

Left Joins

When you perform a left join, SQL keeps everything from the left table, no matter what, and then only pulls in information from the right table that matches the left based on the columns you specify in the ON statement. So when we run this query:


SELECT *
FROM nutrition1 INNER JOIN nutrition2
ON nutrition1.id = nutrition2.id

we’ll get this result:

idnamecalciumserving_sizeweightidpercent_waterenergyproteinfood_type
1apples101 ea138184800.3FRUITS
2asparagus221/2 cup90292222.3VEGETABLES AND LEGUMES
3avocado191 ea1733733054FRUITS
4bamboo shoots101 cup131494252.3VEGETABLES AND LEGUMES
5banana71 ea1145741051.2FRUITS
6beets91/2 cup85NULLNULLNULLNULLNULL
7blackberries461 cup144NULLNULLNULLNULLNULL
8Blueberries raw91 cup145NULLNULLNULLNULLNULL
9broccoli2051 spear180NULLNULLNULLNULLNULL
10brussels sprouts561 cup156NULLNULLNULLNULLNULL

For this result, we have the entire nutrition 1 table (id numbers 1-10) and lined up next to it the rows from the nutrition2 table that matched base on the id (just 1-5). Notice the NULL values. The nutrition2 table didn’t have rows for id numnbers 6-10, but we had to keep those rows because the definition of a left join tells us to keep everything from the LEFT table, so the nutrition2columns for id numbers 6-10 get filled in with NULLs.

Outer Joins

Finally, let’s take a look at what an outer join does. Here’s the query:


SELECT *
FROM nutrition1 INNER JOIN nutrition2
ON nutrition1.id = nutrition2.id

and the result:

idnamecalciumserving_sizeweightidpercent_waterenergyproteinfood_type
1apples101 ea138184800.3FRUITS
2asparagus221/2 cup90292222.3VEGETABLES AND LEGUMES
3avocado191 ea1733733054FRUITS
4bamboo shoots101 cup131494252.3VEGETABLES AND LEGUMES
5banana71 ea1145741051.2FRUITS
6beets91/2 cup85NULLNULLNULLNULLNULL
7blackberries461 cup144NULLNULLNULLNULLNULL
8Blueberries raw91 cup145NULLNULLNULLNULLNULL
9broccoli2051 spear180NULLNULLNULLNULLNULL
10brussels sprouts561 cup156NULLNULLNULLNULLNULL
NULLNULLNULLNULLNULL1190942.4FRUITS
NULLNULLNULLNULLNULL1288310.7VEGETABLES AND LEGUMES
NULLNULLNULLNULLNULL1392151.2VEGETABLES AND LEGUMES
NULLNULLNULLNULLNULL149560.3VEGETABLES AND LEGUMES
NULLNULLNULLNULLNULL1581490.8FRUITS

Outer joins keep everything, no matter what, and line up rows that match. Here, we have the entire nutrition1 table, and next to it the nutrition2 table but broken up so that it matches the id numbers of nutrition1 properly.

Takeaways

Understanding the definitions of the joins and how each fundamentally works is very important in SQL. Usually, you’ll be working with so much data that you have to join the tables together in SQL and can’t just pull everything into Excel and do a VLOOKUP. Excel can only handle so many rows (a little over a million) and so many columns (a little over 16,000). That sounds like a lot but SQL databases can hold millions and with more and more data being generated and available every day, there’s a good chance that even at a small company you’ll be dealing with more information than can fit into Excel. Sometimes too, your data will fit into Excel but the VLOOKUP will be so computationally taxing that Excel will simply crash or go so slow it isn’t worth it. SQL is built for stuff like this, so it will compute joins much faster. Check out my understanding SQL blog post for a more in-depth discussino of what SQL is and when to use it.


Need help applying these concepts to your data?

Chat with us about options.

Schedule a Meeting   


Continue to make data-driven decisions.

Sign up for our email guides that contains relevant tips, software tricks, and news from the data world.

*We never spam you or sell your information.

* indicates required

"Understanding Multiple JOINs in SQL"

"Python Resources"