Tabular magic

Tables are a marvelous way of organizing data but they come at a cost, which is projecting multidimensional data onto a two-dimensional representation. Relational database management systems handle storage and retrieval of tabular data. The incantations to do this are uttered in structured query language—SQL—a powerful set of magical spells. For a programming language, it's quite readable.

SELECT custid, balance FROM orders;

But things can get quickly more complicated as anyone who uses SQL only occasionally is soon reminded. My most recent reminder came while preparing age pyramids from Census data. Age pyramids show the distribution of population, like this one.

US Age Pyramid

The source data was cut and pasted from U.S. Census Bureau, U.S. Department of Commerce. Age and Sex. American Community Survey, ACS 1-Year Estimates Subject Tables, Table S0101, 2023. Accessed on February 8, 2025.

age_groups male female
Under 5 years 9373156 8960541
5 to 9 years 10136158 9663272
10 to 14 years 10877744 10326135
15 to 19 years 11364472 10803918
20 to 24 years 11073959 10544424
25 to 29 years 11090992 10815714
30 to 34 years 11811352 11593704
35 to 39 years 11437479 11212620
40 to 44 years 11100697 11025788
45 to 49 years 9893835 9965395
50 to 54 years 10297208 10364733
55 to 59 years 9929586 10268922
60 to 64 years 10543076 11132960
65 to 69 years 9025813 10001148
70 to 74 years 7367430 8430427
75 to 79 years 5129494 6189257
80 to 84 years 3013517 4027902
85 years and over 2263405 3858663

That's easy enough, but what I was after was to do it for New England, its states and their counties. The Census county tables look like this

    "GEO_ID","NAME","B01001_001E","B01001_001M","B01001_002E","B01001_002M","B01001_003E","B01001_003M","B01001_004E","B01001_004M","B01001_005E","B01001_005M","B01001_006E","B01001_006M","B01001_007E","B01001_007M","B01001_008E","B01001_008M","B01001_009E","B01001_009M","B01001_010E","B01001_010M","B01001_011E","B01001_011M","B01001_012E","B01001_012M","B01001_013E","B01001_013M","B01001_014E","B01001_014M","B01001_015E","B01001_015M","B01001_016E","B01001_016M","B01001_017E","B01001_017M","B01001_018E","B01001_018M","B01001_019E","B01001_019M","B01001_020E","B01001_020M","B01001_021E","B01001_021M","B01001_022E","B01001_022M","B01001_023E","B01001_023M","B01001_024E","B01001_024M","B01001_025E","B01001_025M","B01001_026E","B01001_026M","B01001_027E","B01001_027M","B01001_028E","B01001_028M","B01001_029E","B01001_029M","B01001_030E","B01001_030M","B01001_031E","B01001_031M","B01001_032E","B01001_032M","B01001_033E","B01001_033M","B01001_034E","B01001_034M","B01001_035E","B01001_035M","B01001_036E","B01001_036M","B01001_037E","B01001_037M","B01001_038E","B01001_038M","B01001_039E","B01001_039M","B01001_040E","B01001_040M","B01001_041E","B01001_041M","B01001_042E","B01001_042M","B01001_043E","B01001_043M","B01001_044E","B01001_044M","B01001_045E","B01001_045M","B01001_046E","B01001_046M","B01001_047E","B01001_047M","B01001_048E","B01001_048M","B01001_049E","B01001_049M",
    "0500000US01001","Autauga County, Alabama","59285","*****","28669","263","1720","223","1855","279","2115","272","1306","84","575","97","214","157","336","124","1145","239","1930","154","1772","51","2264","239","1489","228","2001","239","1931","89","2133","236","743","177","1004","253","584","161","833","174","1098","213","980","148","338","104","303","116","30616","263","1710","204","1556","296","2223","306","1441","160","600","114","123","77","611","184","988","172","1841","41","2062","123","1744","278","2405","286","2016","114","1956","46","2231","267","758","185","1013","240","590","148","835","187","1519","203","1220","219","680","153","494","178",

This is a job for SQL, so I import it into my census database and turn for help to my buddy Claude

        I have postgres tables census.counties with the following schema
        
        geocoder=# \d census.counties
        Table "census.counties"
        Column   |            Type             | Collation | Nullable |                   Default                    
        ------------+-----------------------------+-----------+----------+----------------------------------------------
        gid        | integer                     |           | not null | nextval('census.counties_gid_seq'::regclass)
        statefp    | character varying(2)        |           |          | 
        countyfp   | character varying(3)        |           |          | 
        countyns   | character varying(8)        |           |          | 
        geoidfq    | character varying(14)       |           |          | 
        geoid      | character varying(5)        |           |          | 
        name       | character varying(100)      |           |          | 
        namelsad   | character varying(100)      |           |          | 
        stusps     | character varying(2)        |           |          | 
        state_name | character varying(100)      |           |          | 
        lsad       | character varying(2)        |           |          | 
        aland      | double precision            |           |          | 
        awater     | double precision            |           |          | 
        geom       | geometry(MultiPolygon,4269) |           |          | 
        Indexes:
        "counties_pkey" PRIMARY KEY, btree (gid)
        "counties_geoid_key" UNIQUE CONSTRAINT, btree (geoid)
        "counties_geom_idx" gist (geom)
        "counties_name_idx" btree (name)
        Referenced by:
        TABLE "census.variable_data" CONSTRAINT "variable_data_geoid_fkey" FOREIGN KEY (geoid) REFERENCES census.counties(geoid)
        
        and a table census.variable_data with
        
        geocoder=# \d census.variable_data
        Table "census.variable_data"
        Column     |          Type          | Collation | Nullable | Default 
        ---------------+------------------------+-----------+----------+---------
        geoid         | character varying(5)   |           | not null | 
        variable_name | character varying(50)  |           | not null | 
        value         | bigint                 |           |          | 
        name          | character varying(100) |           |          | 
        and variable_name has the following values
        female_62_to_64_years
        female_50_to_54_years
        female
        female_18_and_19_years
        male_60_and_61_years
        female_35_to_39_years
        female_80_to_84_years
        male_21_years
        female_under_5_years
        female_21_years
        male_15_to_17_years
        female_15_to_17_years
        male_85_years_and_over
        male_22_to_24_years
        male_35_to_39_years
        male
        female_65_and_66_years
        male_10_to_14_years
        male_40_to_44_years
        female_25_to_29_years
        female_10_to_14_years
        male_62_to_64_years
        male_70_to_74_years
        female_70_to_74_years
        total_population
        male_65_and_66_years
        female_67_to_69_years
        male_25_to_29_years
        male_75_to_79_years
        male_55_to_59_years
        female_22_to_24_years
        male_5_to_9_years
        male_50_to_54_years
        male_45_to_49_years
        male_20_years
        female_85_years_and_over
        male_18_and_19_years
        female_20_years
        female_75_to_79_years
        female_55_to_59_years
        female_30_to_34_years
        male_80_to_84_years
        male_30_to_34_years
        female_5_to_9_years
        male_under_5_years
        female_45_to_49_years
        female_40_to_44_years
        female_60_and_61_years
        male_67_to_69_years
        
        and I want a query to return this format 
        
        age_group   male_pop   female_pop
        "0-4"       1000       950
        "5-9"       950        900
        
        for each state with an associated stusps code 
        6-element Vector{String}:
        "MA"
        "CT"
        "NH"
        "RI"
        "VT"
        "ME"
        
        and for those six states as a group
    

This follows my usual query recipe, which is to describe in detail what I have and what I want.

The scripts come back. For the individual states

        WITH age_groups AS (
        SELECT 
        c.stusps,
        CASE 
        WHEN vd.variable_name LIKE '%under_5%' THEN '0-4'
        WHEN vd.variable_name LIKE '%5_to_9%' THEN '5-9'
        WHEN vd.variable_name LIKE '%10_to_14%' THEN '10-14'
        WHEN vd.variable_name LIKE '%15_to_17%' THEN '15-17'
        WHEN vd.variable_name LIKE '%18_and_19%' THEN '18-19'
        WHEN vd.variable_name LIKE '%20%' THEN '20'
        WHEN vd.variable_name LIKE '%21%' THEN '21'
        WHEN vd.variable_name LIKE '%22_to_24%' THEN '22-24'
        WHEN vd.variable_name LIKE '%25_to_29%' THEN '25-29'
        WHEN vd.variable_name LIKE '%30_to_34%' THEN '30-34'
        WHEN vd.variable_name LIKE '%35_to_39%' THEN '35-39'
        WHEN vd.variable_name LIKE '%40_to_44%' THEN '40-44'
        WHEN vd.variable_name LIKE '%45_to_49%' THEN '45-49'
        WHEN vd.variable_name LIKE '%50_to_54%' THEN '50-54'
        WHEN vd.variable_name LIKE '%55_to_59%' THEN '55-59'
        WHEN vd.variable_name LIKE '%60_and_61%' THEN '60-61'
        WHEN vd.variable_name LIKE '%62_to_64%' THEN '62-64'
        WHEN vd.variable_name LIKE '%65_and_66%' THEN '65-66'
        WHEN vd.variable_name LIKE '%67_to_69%' THEN '67-69'
        WHEN vd.variable_name LIKE '%70_to_74%' THEN '70-74'
        WHEN vd.variable_name LIKE '%75_to_79%' THEN '75-79'
        WHEN vd.variable_name LIKE '%80_to_84%' THEN '80-84'
        WHEN vd.variable_name LIKE '%85%' THEN '85+'
        END AS age_group,
        CASE 
        WHEN vd.variable_name LIKE 'male%' THEN 'male'
        WHEN vd.variable_name LIKE 'female%' THEN 'female'
        END AS gender,
        SUM(vd.value) as population
        FROM census.counties c
        JOIN census.variable_data vd ON c.geoid = vd.geoid
        WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME')
        AND vd.variable_name NOT IN ('male', 'female', 'total_population')
        GROUP BY c.stusps, age_group, gender
        )
        SELECT 
        stusps,
        age_group,
        MAX(CASE WHEN gender = 'male' THEN population END) as male_pop,
        MAX(CASE WHEN gender = 'female' THEN population END) as female_pop
        FROM age_groups
        WHERE age_group IS NOT NULL
        GROUP BY stusps, age_group
        ORDER BY stusps, 
        CASE 
        WHEN age_group = '85+' THEN 85
        ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER)
        END;
    

and for the roll-up

        WITH age_groups AS (
        SELECT 
        CASE 
        WHEN vd.variable_name LIKE '%under_5%' THEN '0-4'
        -- [same CASE statement as above for age groups]
        END AS age_group,
        CASE 
        WHEN vd.variable_name LIKE 'male%' THEN 'male'
        WHEN vd.variable_name LIKE 'female%' THEN 'female'
        END AS gender,
        SUM(vd.value) as population
        FROM census.counties c
        JOIN census.variable_data vd ON c.geoid = vd.geoid
        WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME')
        AND vd.variable_name NOT IN ('male', 'female', 'total_population')
        GROUP BY age_group, gender
        )
        SELECT 
        age_group,
        MAX(CASE WHEN gender = 'male' THEN population END) as male_pop,
        MAX(CASE WHEN gender = 'female' THEN population END) as female_pop
        FROM age_groups
        WHERE age_group IS NOT NULL
        GROUP BY age_group
        ORDER BY 
        CASE 
        WHEN age_group = '85+' THEN 85
        ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER)
        END;
    

It just worked. This is something that I could have figured out eventually through trial and error and recourse to manuals, but it would have taken hours instead of minutes. Being able to be able to pose the question depended on a fair amount of such experience and a firm grasp of the basics. But now I can confidently rely on being able to do such tricky joins without the overhang of syntax. Well worth the $200/per year.

New England Age Pyramid

I'm not sure why the census breaks out the 20- and 21-year olds separately, but I'm sure Claude can help be meld those into the adjoining cohorts.s