Exploring ecommerce data with SQL in Google BigQuery
Open BigQuery Console
Once you open BigQuery, click on the below direct link to bring in the public data-to-insights project into your BigQuery projects panel:
Lastly, ensure Legacy SQL is disabled by clicking:
- Compose Query
- Show Options
- Untick “Use Legacy SQL” (or leave it deselected)
- Hide Options
The schema for the data-to-insights ecommerce dataset is below.
Explore e-commerce data
Let’s start by analyzing the all_sessions table.
If you’ve given a dataset, the first thing that you have to do is really understand what fields are available for you to query and whether or not that dataset is of good quality.
Go through the schema and then see the details for the size of the dataset .
Click on the details tab to get info on the table:
The table has 21 million rows, that’s a decently big dataset.
All the Web Analytics associated with a visitor’s journey or interaction or experience with that site, is recorded in a single session. A session could include things like transactions or viewing different products, adding them into the cart and pretty much anything having to do with interacting with the Website.
The first question question to ask is what does an individual row represent?
See the preview to understand the data. Hover over to get the data type of each value.
Here, each row represents an unique visitor id.
Let’s see if there are any duplicate records in the dataset.
We don’t have any duplicate records in this dataset.
Doing this massive group by operation across 21 million rows is performance wise an expensive operation. So, you want to avoid doing massive group by unless we must and in this case, we have to.
Calculate the conversion rate
Let’s calculate the conversion rate for products with over 1000 units that have been added to cart or ordered that are not frisbees:
- How many unique times was the product part of an order?
- How many total units of the product were part of orders?
- Which product had the highest conversion rate?
The results show that people really like Google 25 ounce clear stainless steel bottle. So, a little bit more than a third of the times that it’s been viewed, it has been added to a cart.
Every row in our dataset is a view in the product so COUNT(*) gives the total number of product_views.
Product quantity indicates to us whether it was part of an incomplete. When I say an incomplete order that means it was added to the cart, but the person did not check out yet or it’s part of an order that has actually gone through completion and has revenue associated with it.
The COUNT(productQuantity) gives us the total number of potential orders.
The SUM(productQuantity) gives us the quantity of products that were added.
The conversion rate is potential orders divided by the total product views.
(COUNT(productQuantity) / COUNT(*)) AS conversion_rate
I convert all the product names to lower case using LOWER keyword (you can also use UPPER) and then use a WHERE clause to exclude products with the name “frisbee”.
Always use WHERE clause before aggregation and HAVING clause after aggregation.
I then group by product names and filter products with more than 1000 items.
Finally, I order the results by conversion rates in descending order.
Track the visitor checkout progress
Write a query that shows the eCommerceAction_type and the distinct count of fullVisitorId associated with each type.
Now, let’s calculate how many unique visitors are in the checkout process. We will be using eCommerceAction_type and fullVistorId columns.
The eCommerceAction_type column has the following mappings:
- Unknown = 0
- Click through of product lists = 1
- Product detail views = 2
- Add product(s) to cart = 3
- Remove product(s) from cart = 4
- Check out = 5
- Completed purchase = 6
- Refund of purchase = 7
- Checkout options = 8
I use DISTINCT keyword to select the unique fullVisitorId’s as unique visitors.
I use a CASE statement to create a new column called checkout_stage and to map each eCommerceAction_type to the corresponding label. Finally, I group and order by the ecommerceAction_type.
Track Abandoned Carts from high quality sessions
Let’s write a query that returns the unique session ids of those visitors who have added a product to their cart but never completed checkout (abandoned their shopping cart).
From our Schema, we know that the visit ID is an identifier for the session, this is only unique to the user. For a complete Unique ID, you should combine full visitor ID and visit ID into what we’re going to call unique session ID.
We also use a sessionQualityDim to measure the quality of a session. Basically, it’s an estimate of how close a particular session was to transacting from one to 100. SUM(productRevenue) gives the revenues of the transactions.
To know where a user is in the checkout process, we can use the e-commerce action type field. We are using a max function to make sure the user has added the product to the cart at least and not just looked at those pages.
I choose sessions that have sessionQualityDim more than 60, and group by unique session ID and sessionQualityDim.
Finally, I use a Having clause to filter sessions whose checkout progress value is three and transaction revenue is zero.