by boxplot    Mar 8, 2018 10:49 am  

Joins in SQL

Joins are one of the most important (if not THE most important) concepts in SQL. If you take the time to solidly understand how joins work, you’ll be in an excellent place for writing queries. So, let’s dive in!

Join Definitions

Joining tables in SQL is a way of combining them. It is directly comparable to a VLOOKUP in Excel, so if you are familiar wtih VLOOKUPs you already understand how joins work. For example, let’s say you have a table that contains orders placed by your customers (each row in that table represents a unique order). Then you have a second table that is only items that have been returned by your customers (each row in that table represents a unique order too, but not all orders are returned so the table has fewer rows).

Orders Table

order_id product cost_per_unit total_amount profit shipping_method
1 A 5 70 5 Air
2 B 10 120 97 Air
3 C 15 150 135 Air
4 A 5 20 10 Air
5 B 10 110 96 Air
6 C 15 30 5 Air
7 A 5 45 9 Air
8 B 10 150 64 Air
9 C 15 210 63 Air
10 A 5 65 44 Air
11 B 10 120 76 Air
12 C 15 30 18 Air
13 A 5 20 19 Ground
14 B 10 100 59 Ground
15 C 15 30 14 Ground
16 A 5 55 15 Ground
17 B 10 10 1 Ground
18 C 15 105 28 Ground
19 A 5 75 37 Ground
20 B 10 50 41 Ground
21 C 15 15 9 Ground
22 A 5 30 22 Ground
23 B 10 120 117 Ground
24 C 15 195 166 Ground
25 A 5 20 1 Ground

Returns Table

id amount_returned Reason
15 10 Broken
2 115 No Reason Given
24 85 Unsatisfied
7 15 Unsatisfied
12 20 Unsatisfied
9 130 Unsatisfied
17 10 No Reason Given
21 15 No Reason Given
18 70 Broken
17 10 Broken

You can’t just copy and paste these tables next to each other to figure out what was returned, because the rows won’t match up. You need a way of comparing each row in the orders table to the returns table to see if a match exists, and then bring back the information from the returns table that you’re interested in. Enter joins!

Let’s take a look at join syntax first:


SELECT *
FROM table_1 INNER JOIN table_2
ON table_1.id = table_2.id
	

Notice the ON statement, which is required for every join. You need to pick a column from each table that you want SQL to compare in order to join them together, and that’s what the ON statement is specifying. This example says compare the id column in table_1 to the id column in table_2 when trying to match up rows between the two tables during the join.

How do you choose which colunns to use for the ON statement? Generally you want to pick a column that uniquely identifies each row, like an id number. The two columns don’t have to have the exact same name, but they need to be the same type (for example, integer or text) and represent the same thing. The order_id columnn from the orders table above is the same thing as the id column in the returns table – they both represent the id of the order.

There are four types of joins: left, right, inner and outer. The right join functionally does the same thing as the left join, so many SQL programs are choosing not even to support it anymore. For that reason, we won’t be covering it in this article.

As usual, we’ll start with a basic example. Say you have a database with the following two tables in it:

nutrition1

id name calcium serving_size weight
1 apples 10 1 ea 138
2 asparagus 22 1/2 cup 90
3 avocado 19 1 ea 173
4 bamboo shoots 10 1 cup 131
5 banana 7 1 ea 114
6 beets 9 1/2 cup 85
7 blackberries 46 1 cup 144
8 Blueberries raw 9 1 cup 145
9 broccoli 205 1 spear 180
10 brussels sprouts 56 1 cup 156

nutrition2

id percent_water energy protein food_type
1 84 80 0.3 FRUITS
2 92 22 2.3 VEGETABLES AND LEGUMES
3 73 305 4 FRUITS
4 94 25 2.3 VEGETABLES AND LEGUMES
5 74 105 1.2 FRUITS
11 90 94 2.4 FRUITS
12 88 31 0.7 VEGETABLES AND LEGUMES
13 92 15 1.2 VEGETABLES AND LEGUMES
14 95 6 0.3 VEGETABLES AND LEGUMES
15 81 49 0.8 FRUITS

The nutrition1 table contains information about foods, and the nutrition2 table contains different information about foods. We’re going to join on the id column (this will be our ON statement) because it’s the unique identifier for each row in both of the tables. So, take note that nutrition1 has id numbers 1-10, but nutrition2 has id numbers 1-5 and then skips to 11-15.

Inner Joins

When you inner join two different tables together, SQL will compare the columns you specify in the ON statement and only keep rows that those columns have in common. Looking at our nutrition tables, the only id numbers that the two have in common are 1 through 5. So when we run this query:


SELECT *
FROM nutrition1 INNER JOIN nutrition2
ON nutrition1.id = nutrition2.id

we’ll get this result:

id name calcium serving_size weight id percent_water energy protein food_type
1 apples 10 1 ea 138 1 84 80 0.3 FRUITS
2 asparagus 22 1/2 cup 90 2 92 22 2.3 VEGETABLES AND LEGUMES
3 avocado 19 1 ea 173 3 73 305 4 FRUITS
4 bamboo shoots 10 1 cup 131 4 94 25 2.3 VEGETABLES AND LEGUMES
5 banana 7 1 ea 114 5 74 105 1.2 FRUITS

Cool! Now we have more information than we previously did about apples, asparagus, etc. Notice that the result displays 1-5 from nutrition1, lined up next to those columns 1-5 from nutrition2, and everything else is gone. No other rows appear in the result because inner joins only keep what the two tables have in common, and nothing else.

Left Joins

When you perform a left join, SQL keeps everything from the left table, no matter what, and then only pulls in information from the right table that matches the left based on the columns you specify in the ON statement. So when we run this query:


SELECT *
FROM nutrition1 INNER JOIN nutrition2
ON nutrition1.id = nutrition2.id

we’ll get this result:

id name calcium serving_size weight id percent_water energy protein food_type
1 apples 10 1 ea 138 1 84 80 0.3 FRUITS
2 asparagus 22 1/2 cup 90 2 92 22 2.3 VEGETABLES AND LEGUMES
3 avocado 19 1 ea 173 3 73 305 4 FRUITS
4 bamboo shoots 10 1 cup 131 4 94 25 2.3 VEGETABLES AND LEGUMES
5 banana 7 1 ea 114 5 74 105 1.2 FRUITS
6 beets 9 1/2 cup 85 NULL NULL NULL NULL NULL
7 blackberries 46 1 cup 144 NULL NULL NULL NULL NULL
8 Blueberries raw 9 1 cup 145 NULL NULL NULL NULL NULL
9 broccoli 205 1 spear 180 NULL NULL NULL NULL NULL
10 brussels sprouts 56 1 cup 156 NULL NULL NULL NULL NULL

For this result, we have the entire nutrition 1 table (id numbers 1-10) and lined up next to it the rows from the nutrition2 table that matched base on the id (just 1-5). Notice the NULL values. The nutrition2 table didn’t have rows for id numnbers 6-10, but we had to keep those rows because the definition of a left join tells us to keep everything from the LEFT table, so the nutrition2columns for id numbers 6-10 get filled in with NULLs.

Outer Joins

Finally, let’s take a look at what an outer join does. Here’s the query:


SELECT *
FROM nutrition1 INNER JOIN nutrition2
ON nutrition1.id = nutrition2.id

and the result:

id name calcium serving_size weight id percent_water energy protein food_type
1 apples 10 1 ea 138 1 84 80 0.3 FRUITS
2 asparagus 22 1/2 cup 90 2 92 22 2.3 VEGETABLES AND LEGUMES
3 avocado 19 1 ea 173 3 73 305 4 FRUITS
4 bamboo shoots 10 1 cup 131 4 94 25 2.3 VEGETABLES AND LEGUMES
5 banana 7 1 ea 114 5 74 105 1.2 FRUITS
6 beets 9 1/2 cup 85 NULL NULL NULL NULL NULL
7 blackberries 46 1 cup 144 NULL NULL NULL NULL NULL
8 Blueberries raw 9 1 cup 145 NULL NULL NULL NULL NULL
9 broccoli 205 1 spear 180 NULL NULL NULL NULL NULL
10 brussels sprouts 56 1 cup 156 NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL 11 90 94 2.4 FRUITS
NULL NULL NULL NULL NULL 12 88 31 0.7 VEGETABLES AND LEGUMES
NULL NULL NULL NULL NULL 13 92 15 1.2 VEGETABLES AND LEGUMES
NULL NULL NULL NULL NULL 14 95 6 0.3 VEGETABLES AND LEGUMES
NULL NULL NULL NULL NULL 15 81 49 0.8 FRUITS

Outer joins keep everything, no matter what, and line up rows that match. Here, we have the entire nutrition1 table, and next to it the nutrition2 table but broken up so that it matches the id numbers of nutrition1 properly.

Takeaways

Understanding the definitions of the joins and how each fundamentally works is very important in SQL. Usually, you’ll be working with so much data that you have to join the tables together in SQL and can’t just pull everything into Excel and do a VLOOKUP. Excel can only handle so many rows (a little over a million) and so many columns (a little over 16,000). That sounds like a lot but SQL databases can hold millions and with more and more data being generated and available every day, there’s a good chance that even at a small company you’ll be dealing with more information than can fit into Excel. Sometimes too, your data will fit into Excel but the VLOOKUP will be so computationally taxing that Excel will simply crash or go so slow it isn’t worth it. SQL is built for stuff like this, so it will compute joins much faster. Check out my understanding SQL blog post for a more in-depth discussino of what SQL is and when to use it.


Need help applying these concepts to your data? Ask your department or organization to subscribe to our DataChat program.

Subscribe  

Book A Single DataChat  


Continue to make data-driven decisions.

Sign up for our email guides that contains relevant tips, software tricks, and news from the data world.

*We never spam you or sell your information.

* indicates required

"What is SQL?"

"Using DB Browser for SQLite"


0 Shares
Share
Tweet
Share