Recursive CTE project

October, 2022
 

Summary
Outcome
Project Steps

Summary

The data from the UK Gender Pay Gap service contains a column called siccodes which is based on the UK SIC code classification system. In another project in this portfolio, I created a local database of that data.

Unfortunately, the siccodes column contains a variable length, comma-delimited list of SIC codes, making it impossible to join directly to any SIC code information.

As you can see, above, some companies have multiple SIC codes—and many have an invalid code “1.”

The solution to the problem is to create a recursive common table expression (CTE) to return one row for every SIC code (and omitting the bogus “1s”).

This project does that.

PostgreSQL
pgAdmin

Outcome

The final outcome of the recursive CTE is a view, vw_gpg_sic_codes. The view joins gender_pay_gap to siccodes on the SIC code data.

With the view created, I use Python code to do some basic analysis and visualizations of the Gender Pay Gap data based on SIC code.

Project Steps

STEP ONE: Understand the basics of a recursive CTE.

A recursive CTE has two parts, an initial query, often called the “anchor” query, and the secondary query, often called the “recursive” query, which begins execution using the results of the anchor query and adds to it through recursion.

The execution can be graphically represented in a flowchart:

The two parts of the CTE form a UNION query. The recursive query’s FROM clause calls back to the CTE.

The following simple query demonstrates:

(Click image to open a new tab and access the SQL code.)

STEP TWO: Apply the concepts to the gender_pay_gap table.

With a basic understanding of recursive CTEs, the following is the process for applying this to the siccodes field:

  • Convert the comma-delimited siccodes field to a string array: STRING_TO_ARRAY( siccodes, CHR(10) ). The column we create with this SQL is siccode_array.
  • Call back to the CTE (gpg_siccode_recurs) for each element in siccode_array. This is done by comparing the length of the array to the counter column, counter, and continuing as long as the number of elements in the siccode_array column are greater than the incrementing counter:
    ARRAY_LENGTH( y.siccode_array, 1 ) >= y.counter + 1
(Click image to open a new tab and access the SQL code.)

STEP THREE:

With the view created, I can now use Python to create some visualizations based on the SIC information.

My Python code for this project is available via GitHub.