Exploring IRS data with SQL in Google BigQuery
In this post, we will perform an analysis on the IRS dataset.
We will be answering these two questions through our analysis:
- Largest employer per U.S. state per 2015 filing
- Calculate the `avg_income` (revenue – expenses) and sort the results by avg_income from highest to lowest. Calculate ‘avg_income’ for the years 2013, 2014, 2015 and 2016.
About the IRS dataset
Form 990 is used by the United States Internal Revenue Service to gather financial information about nonprofit/exempt organizations. This BigQuery dataset can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. For a complete description of data variables available in this dataset, see the IRS’s extract documentation: https://www.irs.gov/uac/soi-tax-stats-annual-extract-of-tax-exempt-organization-financial-data
Filtering using common table expression(CTE)
First, I include ein (employer identification number), name, state and noemplyeesw3cnt (number of employees) in the SELECT statement.
Then I join irs_990_2015 table on the we are going to JOIN on the organizational details (irs_990_ein) table. I will use USING clause to as a key to join the tables.
If it’s the exact same name of the field across each different datasets, you can use the USING clause and it’ll save you just a little bit of typing.
Then I create a partition by state and order by the number of employees in descending order.
(PARTITION BY state ORDER BY noemplyeesw3cnt DESC )
RANK() OVER (PARTITION BY state ORDER BY noemplyeesw3cnt DESC ) AS rank
Now, I rank the records over this particular partition that I created in parentheses using a RANK() OVER function.
Now, we have to filter these records for the top ranked ones.
I am going convert the above query output into a temporary table using a WITH clause. The WITH clause is called the common table expression.
Finally, I select all the records from this temporary table and filter only the records with rank 1. And then de-duplicate the records by using a group by clause inside the temporary table.
So it looks like A FAMILY AFFAIR BRIDGES INCORPORATED has 787,000 employees in California. And then we have CROSSBRIDGES, KAISER FOUNDATIONAL HOSPITALS in Oregon.
I start by calculating the average revenue and expenses for the year 2015. Now, we want to include 2013 and 2014 too. I am going to use a table wildcard for that.
The table_suffix is exactly what that wildcard is replacing. The great thing about table suffix, is you can now invoke that as a new field and potentially filter on it in your where clause as well.
I’m going to just execute the following query so you can see the value of table suffix.
The query failed because anytime you invoke something that’s not an aggregate function, you immediately need to have a group by. So you can group by the index of the column, or you can actually group by the alias in this particular case.
The filing year is shown as 15, 13, 12, etc. because we hard coded this prefix: irs_990_20*.
I am going to concatenate the prefix in there.
I am going to use a WHERE to clause to only include the values from 2013 to 2015.
The query returned zero records because I mentioned ‘2013’ in the WHERE clause instead of mentioning ’13’. So, we have to keep in mind what that wildcard is set as.
I wrap the above query into a common table expression and calculate the average income.
The income of these charities have been growing from 2013 to 2015 but declined in the year 2016.