Recursive CTE project
October, 2022
Summary
Outcome
Project Steps
Summary
The data from the UK Gender Pay Gap service contains a column called siccodes
which is based on the UK SIC code classification system. In another project in this portfolio, I created a local database of that data.
Unfortunately, the siccodes
column contains a variable length, comma-delimited list of SIC codes, making it impossible to join directly to any SIC code information.
As you can see, above, some companies have multiple SIC codes—and many have an invalid code “1.”
The solution to the problem is to create a recursive common table expression (CTE) to return one row for every SIC code (and omitting the bogus “1s”).
This project does that.
PostgreSQL
pgAdmin
Outcome
The final outcome of the recursive CTE is a view, vw_gpg_sic_codes
. The view joins gender_pay_gap
to siccodes
on the SIC code data.
With the view created, I use Python code to do some basic analysis and visualizations of the Gender Pay Gap data based on SIC code.
Project Steps
STEP ONE: Understand the basics of a recursive CTE.
A recursive CTE has two parts, an initial query, often called the “anchor” query, and the secondary query, often called the “recursive” query, which begins execution using the results of the anchor query and adds to it through recursion.
The execution can be graphically represented in a flowchart:
The two parts of the CTE form a UNION query. The recursive query’s FROM clause calls back to the CTE.
The following simple query demonstrates:
STEP TWO: Apply the concepts to the gender_pay_gap
table.
With a basic understanding of recursive CTEs, the following is the process for applying this to the siccodes
field:
- Convert the comma-delimited
siccodes
field to a string array:STRING_TO_ARRAY( siccodes, CHR(10) )
. The column we create with this SQL issiccode_array
. - Call back to the CTE (
gpg_siccode_recurs
) for each element insiccode_array
. This is done by comparing the length of the array to the counter column,counter
, and continuing as long as the number of elements in thesiccode_array
column are greater than the incrementing counter:ARRAY_LENGTH( y.siccode_array, 1 ) >= y.counter + 1
STEP THREE:
With the view created, I can now use Python to create some visualizations based on the SIC information.
My Python code for this project is available via GitHub.