Common SQL JOIN pitfalls
In this post, we will be using the ecommerce data that we used in the previous article to explore the most common SQL pitfalls.
Let’s say your marketing team is going to provide you with a new dataset on the inventory stock levels for each of your products for sale on your ecommerce website. Before they do, you wanted to get more familiar with what products are on the website and what field you could use to potentially join on to other datasets.
Identifying Duplicate Records
First let’s find how many product names and product SKUs are on our website and whether either one of those fields is unique.
There are 2273 products and SKUs. So does that mean we have 2,273 unique product SKUs?
Let’s find the count of unique SKUs alone.
Hmmm. We have 1,909 distinct SKUs which does not match 2,273 as we expected.
Next, we need to determine which products have more than one SKU and which SKUs have more than one product.
Product name is not unique(expected for variants).
The query results show that some products have more than one SKU. One product name (e.g. T-Shirt) can have multiple product variants like color, size, etc.. It is expected that one product has many SKUs.
It looks like there are quite a few SKUs that have more than one product name. Several of the product names appear to be closely related with a few misspellings (e.g. Waterproof Gear Bag vs Waterpoof Gear Bag).
SKU is not unique which indicates data quality issues.
Let’s see why this could be an issue
Pitfall: Non-unique key
A SKU is designed to uniquely identify one product and will be the basis of our join condition when we join against other tables. Having a non-unique key can cause serious data issues as we’ll now explore.
Let’s understand the concept for just one SKU.
Let’s write a query to identify all the product names for the SKU ‘GGOEGPJC019099’
Looking at the query results, there is a special character in one name and a slightly different name for another.
Joining website data against our product inventory list
Let’s see the impact of joining on a dataset with multiple products for a single SKU. First let’s explore the product inventory dataset our team gave us to see if this SKU is unique there.
SELECT * FROM `data-to-insights.ecommerce.products`
WHERE SKU = ‘GGOEGPJC019099’
Just one record returned so this SKU is unique in the product inventory dataset.
Join pitfall: Unintentional many-to-one SKU relationship
Next, let’s join the inventory dataset against our website product names and SKUs so we can have the inventory stock level associated with each product for sale on the website.
Next, let’s run a query that will show the total stock level for each item in inventory.
The dog frisbee is not properly showing a stock level of 154. Instead it is triple counting(154 X3). This is called unintentional cross join.
Join pitfall solution: Use distinct SKUs before joining
What are the options to solve our triple counting dilemma? First we need to only select distinct SKUs from the website before joining on other datasets.
Write a query to return the count of distinct productSKU from `data-to-insights.ecommerce.all_sessions_raw`
There are 1909 distinct SKUs.
Joining the data
Now we’re ready to join against our product inventory dataset.
I am doing a LEFT JOIN to include all records from the website table regardless of whether there is a match on a product inventory SKU record.
Many inventory SKU values are NULL.
Let’s see how many SKUs are missing from our product inventory set.
819 products are missing (SKU IS NULL) from our product inventory dataset. There are several reasons for this:
- Some SKUs could be digital products that we dont store in inventory
- Old products we sold in past website orders are no longer offered in current inventory
- Legitimate missing data from inventory and should be tracked
Now, let’s see if there are any products are in the product inventory dataset but missing from the website?
There are two product SKUs missing from the website dataset.
The first product in the query results is a new product (no orders, no sentimentScore). The website dataset is past order transactions by customers brand new products which have never been sold and won’t show up in web analytics until they’re viewed or purchased.
What if you wanted one query that listed all products missing from either the website or inventory?
We can use a FULL JOIN.
We have our 819 + 2 = 821 product SKUs
LEFT JOIN + RIGHT JOIN = FULL JOIN which returns all records from both tables regardless of matching join keys. We then filter out where we have mismatches on either side.
We’ve successfully troubleshooted and solved some serious SQL join pitfalls by identifying duplicate records and knowing when to use each type of JOIN.