UK Gender Pay Gap project

February 4, 2023
 

Summary
Outcome
Project steps

Summary

This project involved creating a local PostgreSQL database of data from the UK Gender Pay Gap service. Particularly, I wanted a local database so that I could demonstrate a recursive common table expression as a means of relating employer info from the service’s data with industry designations based on UK SIC codes.

In this project, I used Python to create a truncated version of the data and then used Excel to identify data types and create the SQL needed to build the database. I also used Excel to create CSV files for the SIC code tables. I had already installed PostgreSQL Server for another project, so that wasn’t part of this project. For the final part of the project, I used pgAdmin to connect to my PostgreSQL Server and import the data from the UK Gender Pay Gap service.

Python
Excel
PostgreSQL Server
pgAdmin

Outcome

The project provide me with in a PostgreSQL database of 9 tables and a view.

  • employersizes: A lookup table to permit sorting when reporting the employersize field.
  • gender_pay_gap: The data from the UK Gender Pay Gap service.
  • sicactivities: A list of commercial activities and their associated UK SIC codes.
  • sicclasses: A list of the items at the fourth level of SIC classification hierarchy.
  • siccodes: A list of all SIC codes used in the SIC classification hierarchy.
  • sicdivisions: A list of the items at the second level of SIC classification hierarchy.
  • sicgroups: A list of the items at the third level of the SIC classification hierarchy.
  • sicsections: A list of the items at the top level of the SIC classification hierarchy.
  • sicsubsclasses: A list of the items at the fifth, or bottom, level of the SIC classification hierarchy.
  • vw_gpg_sic_codes: A view that facilitates joining the gender_pay_gap table with the siccodes table. Based on a recursive CTE.
Project steps
  1. Download the data.

    The UK Gender Pay Gap data can be found here:
    https://gender-pay-gap.service.gov.uk/viewing/download

    I downloaded all six (as of February 4, 2023) files.

    The UK SIC data can be found here:
    https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list
     
  2. Using Python, create a truncated CSV file of the gender pay gap data.

    Technically, this step was unnecessary. The purpose of the step was to use Excel to identify data types and create SQL to create the gender pay gap table in pgAdmin. However, none of the original files from the UK Gender Pay Gap service was so large that Excel would have been bogged down. (On my laptop, Excel gets bogged down with files in excess of about 50 MB or so.)

    However, I had written Python code for another project which involved millions of rows of data, so I simply applied it here.
     
  3. Import the truncated CSV file into Excel.

    I used Power Query to control how the data types appeared. I chose to import all the columns as text, so that I could see the data in its original form and make my own judgments about data type.
     
  4. Write SQL for the gender pay gap table.

    I used Excel to expedite the creation process, minimize typos and similar. You can see this here.

    Of note, the gender pay gap data has columns of numeric data in which many rows contain empty values. I thought that the simplest method of importing the data into pgAdmin was to create the columns as text first, import the data, update the empty strings to null values, then alter the table to change the columns to numeric data. I did so as step six.

    Also of note, the gender pay gap data does not have a unique identifier that we can use across multiple years, so the CREATE TABLE sql creates a sequence object, declares a primary key (“GPGKey”) and uses the sequence object to fill in unique values.
     
  5. Create the gender pay gap table and import the CSVs.

    Run the CREATE TABLE sql in the query window and then use the pgAdmin Import/Export feature to import the CSV files, one at a time. During this import, with each CSV file, we need to remove the GPGKey, as it is not in the source data. (See notes under Step Four above.)
     
  6. Modify the gender pay gap table.

    See the note for step four, above: after completing the import for all six CSV files, we can update the empty strings to null and alter the table to change the numeric columns from data type varchar to data type numeric.
      
  7. Use Excel, to create CSV files for the UK SIC data.

    The UK SIC classification hierarchy required manipulation in order to properly normalize the data and create CSV files. All of the work is visible in this Excel file.
     
  8. Create the SIC tables and import the CSVs.

    This step is akin to Step Five: we run the SQL commands that create the seven SIC tables, then use pgAdmin’s Import/Export feature to import the CSV files created in Step Seven.
     
  9. Create the vw_gpg_sic_codes view.

    This step is covered as a separate project. See “Recursive CTE Project.”
     
  10. Run SQL to create and populate the employersizes table.

    This is a simple table, and we can use SQL that creates the table and fills in the values.