Tabular magic

Tables are a marvelous way of organizing data but they come at a cost, which is projecting multidimensional data onto a two-dimensional representation. Relational database management systems handle storage and retrieval of tabular data. The incantations to do this are uttered in structured query language—SQL—a powerful set of magical spells. For a programming language, it's quite readable.

SELECT custid, balance FROM orders;

But things can get quickly more complicated as anyone who uses SQL only occasionally is soon reminded. My most recent reminder came while preparing age pyramids from Census data. Age pyramids show the distribution of population, like this one.

The source data was cut and pasted from U.S. Census Bureau, U.S. Department of Commerce. Age and Sex. American Community Survey, ACS 1-Year Estimates Subject Tables, Table S0101, 2023. Accessed on February 8, 2025.

age_groups	male	female
Under 5 years	9373156	8960541
5 to 9 years	10136158	9663272
10 to 14 years	10877744	10326135
15 to 19 years	11364472	10803918
20 to 24 years	11073959	10544424
25 to 29 years	11090992	10815714
30 to 34 years	11811352	11593704
35 to 39 years	11437479	11212620
40 to 44 years	11100697	11025788
45 to 49 years	9893835	9965395
50 to 54 years	10297208	10364733
55 to 59 years	9929586	10268922
60 to 64 years	10543076	11132960
65 to 69 years	9025813	10001148
70 to 74 years	7367430	8430427
75 to 79 years	5129494	6189257
80 to 84 years	3013517	4027902
85 years and over	2263405	3858663

That's easy enough, but what I was after was to do it for New England, its states and their counties. The Census county tables look like this

    "GEO_ID","NAME","B01001_001E","B01001_001M","B01001_002E","B01001_002M","B01001_003E","B01001_003M","B01001_004E","B01001_004M","B01001_005E","B01001_005M","B01001_006E","B01001_006M","B01001_007E","B01001_007M","B01001_008E","B01001_008M","B01001_009E","B01001_009M","B01001_010E","B01001_010M","B01001_011E","B01001_011M","B01001_012E","B01001_012M","B01001_013E","B01001_013M","B01001_014E","B01001_014M","B01001_015E","B01001_015M","B01001_016E","B01001_016M","B01001_017E","B01001_017M","B01001_018E","B01001_018M","B01001_019E","B01001_019M","B01001_020E","B01001_020M","B01001_021E","B01001_021M","B01001_022E","B01001_022M","B01001_023E","B01001_023M","B01001_024E","B01001_024M","B01001_025E","B01001_025M","B01001_026E","B01001_026M","B01001_027E","B01001_027M","B01001_028E","B01001_028M","B01001_029E","B01001_029M","B01001_030E","B01001_030M","B01001_031E","B01001_031M","B01001_032E","B01001_032M","B01001_033E","B01001_033M","B01001_034E","B01001_034M","B01001_035E","B01001_035M","B01001_036E","B01001_036M","B01001_037E","B01001_037M","B01001_038E","B01001_038M","B01001_039E","B01001_039M","B01001_040E","B01001_040M","B01001_041E","B01001_041M","B01001_042E","B01001_042M","B01001_043E","B01001_043M","B01001_044E","B01001_044M","B01001_045E","B01001_045M","B01001_046E","B01001_046M","B01001_047E","B01001_047M","B01001_048E","B01001_048M","B01001_049E","B01001_049M",
    "0500000US01001","Autauga County, Alabama","59285","*****","28669","263","1720","223","1855","279","2115","272","1306","84","575","97","214","157","336","124","1145","239","1930","154","1772","51","2264","239","1489","228","2001","239","1931","89","2133","236","743","177","1004","253","584","161","833","174","1098","213","980","148","338","104","303","116","30616","263","1710","204","1556","296","2223","306","1441","160","600","114","123","77","611","184","988","172","1841","41","2062","123","1744","278","2405","286","2016","114","1956","46","2231","267","758","185","1013","240","590","148","835","187","1519","203","1220","219","680","153","494","178",

This is a job for SQL, so I import it into my census database and turn for help to my buddy Claude

        I have postgres tables census.counties with the following schema
        
        geocoder=# \d census.counties
        Table "census.counties"
        Column   |            Type             | Collation | Nullable |                   Default                    
        ------------+-----------------------------+-----------+----------+----------------------------------------------
        gid        | integer                     |           | not null | nextval('census.counties_gid_seq'::regclass)
        statefp    | character varying(2)        |           |          | 
        countyfp   | character varying(3)        |           |          | 
        countyns   | character varying(8)        |           |          | 
        geoidfq    | character varying(14)       |           |          | 
        geoid      | character varying(5)        |           |          | 
        name       | character varying(100)      |           |          | 
        namelsad   | character varying(100)      |           |          | 
        stusps     | character varying(2)        |           |          | 
        state_name | character varying(100)      |           |          | 
        lsad       | character varying(2)        |           |          | 
        aland      | double precision            |           |          | 
        awater     | double precision            |           |          | 
        geom       | geometry(MultiPolygon,4269) |           |          | 
        Indexes:
        "counties_pkey" PRIMARY KEY, btree (gid)
        "counties_geoid_key" UNIQUE CONSTRAINT, btree (geoid)
        "counties_geom_idx" gist (geom)
        "counties_name_idx" btree (name)
        Referenced by:
        TABLE "census.variable_data" CONSTRAINT "variable_data_geoid_fkey" FOREIGN KEY (geoid) REFERENCES census.counties(geoid)
        
        and a table census.variable_data with
        
        geocoder=# \d census.variable_data
        Table "census.variable_data"
        Column     |          Type          | Collation | Nullable | Default 
        ---------------+------------------------+-----------+----------+---------
        geoid         | character varying(5)   |           | not null | 
        variable_name | character varying(50)  |           | not null | 
        value         | bigint                 |           |          | 
        name          | character varying(100) |           |          | 
        and variable_name has the following values
        female_62_to_64_years
        female_50_to_54_years
        female
        female_18_and_19_years
        male_60_and_61_years
        female_35_to_39_years
        female_80_to_84_years
        male_21_years
        female_under_5_years
        female_21_years
        male_15_to_17_years
        female_15_to_17_years
        male_85_years_and_over
        male_22_to_24_years
        male_35_to_39_years
        male
        female_65_and_66_years
        male_10_to_14_years
        male_40_to_44_years
        female_25_to_29_years
        female_10_to_14_years
        male_62_to_64_years
        male_70_to_74_years
        female_70_to_74_years
        total_population
        male_65_and_66_years
        female_67_to_69_years
        male_25_to_29_years
        male_75_to_79_years
        male_55_to_59_years
        female_22_to_24_years
        male_5_to_9_years
        male_50_to_54_years
        male_45_to_49_years
        male_20_years
        female_85_years_and_over
        male_18_and_19_years
        female_20_years
        female_75_to_79_years
        female_55_to_59_years
        female_30_to_34_years
        male_80_to_84_years
        male_30_to_34_years
        female_5_to_9_years
        male_under_5_years
        female_45_to_49_years
        female_40_to_44_years
        female_60_and_61_years
        male_67_to_69_years
        
        and I want a query to return this format 
        
        age_group   male_pop   female_pop
        "0-4"       1000       950
        "5-9"       950        900
        
        for each state with an associated stusps code 
        6-element Vector{String}:
        "MA"
        "CT"
        "NH"
        "RI"
        "VT"
        "ME"
        
        and for those six states as a group

This follows my usual query recipe, which is to describe in detail what I have and what I want.

The scripts come back. For the individual states

        WITH age_groups AS (
        SELECT 
        c.stusps,
        CASE 
        WHEN vd.variable_name LIKE '%under_5%' THEN '0-4'
        WHEN vd.variable_name LIKE '%5_to_9%' THEN '5-9'
        WHEN vd.variable_name LIKE '%10_to_14%' THEN '10-14'
        WHEN vd.variable_name LIKE '%15_to_17%' THEN '15-17'
        WHEN vd.variable_name LIKE '%18_and_19%' THEN '18-19'
        WHEN vd.variable_name LIKE '%20%' THEN '20'
        WHEN vd.variable_name LIKE '%21%' THEN '21'
        WHEN vd.variable_name LIKE '%22_to_24%' THEN '22-24'
        WHEN vd.variable_name LIKE '%25_to_29%' THEN '25-29'
        WHEN vd.variable_name LIKE '%30_to_34%' THEN '30-34'
        WHEN vd.variable_name LIKE '%35_to_39%' THEN '35-39'
        WHEN vd.variable_name LIKE '%40_to_44%' THEN '40-44'
        WHEN vd.variable_name LIKE '%45_to_49%' THEN '45-49'
        WHEN vd.variable_name LIKE '%50_to_54%' THEN '50-54'
        WHEN vd.variable_name LIKE '%55_to_59%' THEN '55-59'
        WHEN vd.variable_name LIKE '%60_and_61%' THEN '60-61'
        WHEN vd.variable_name LIKE '%62_to_64%' THEN '62-64'
        WHEN vd.variable_name LIKE '%65_and_66%' THEN '65-66'
        WHEN vd.variable_name LIKE '%67_to_69%' THEN '67-69'
        WHEN vd.variable_name LIKE '%70_to_74%' THEN '70-74'
        WHEN vd.variable_name LIKE '%75_to_79%' THEN '75-79'
        WHEN vd.variable_name LIKE '%80_to_84%' THEN '80-84'
        WHEN vd.variable_name LIKE '%85%' THEN '85+'
        END AS age_group,
        CASE 
        WHEN vd.variable_name LIKE 'male%' THEN 'male'
        WHEN vd.variable_name LIKE 'female%' THEN 'female'
        END AS gender,
        SUM(vd.value) as population
        FROM census.counties c
        JOIN census.variable_data vd ON c.geoid = vd.geoid
        WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME')
        AND vd.variable_name NOT IN ('male', 'female', 'total_population')
        GROUP BY c.stusps, age_group, gender
        )
        SELECT 
        stusps,
        age_group,
        MAX(CASE WHEN gender = 'male' THEN population END) as male_pop,
        MAX(CASE WHEN gender = 'female' THEN population END) as female_pop
        FROM age_groups
        WHERE age_group IS NOT NULL
        GROUP BY stusps, age_group
        ORDER BY stusps, 
        CASE 
        WHEN age_group = '85+' THEN 85
        ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER)
        END;

and for the roll-up

        WITH age_groups AS (
        SELECT 
        CASE 
        WHEN vd.variable_name LIKE '%under_5%' THEN '0-4'
        -- [same CASE statement as above for age groups]
        END AS age_group,
        CASE 
        WHEN vd.variable_name LIKE 'male%' THEN 'male'
        WHEN vd.variable_name LIKE 'female%' THEN 'female'
        END AS gender,
        SUM(vd.value) as population
        FROM census.counties c
        JOIN census.variable_data vd ON c.geoid = vd.geoid
        WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME')
        AND vd.variable_name NOT IN ('male', 'female', 'total_population')
        GROUP BY age_group, gender
        )
        SELECT 
        age_group,
        MAX(CASE WHEN gender = 'male' THEN population END) as male_pop,
        MAX(CASE WHEN gender = 'female' THEN population END) as female_pop
        FROM age_groups
        WHERE age_group IS NOT NULL
        GROUP BY age_group
        ORDER BY 
        CASE 
        WHEN age_group = '85+' THEN 85
        ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER)
        END;

It just worked. This is something that I could have figured out eventually through trial and error and recourse to manuals, but it would have taken hours instead of minutes. Being able to be able to pose the question depended on a fair amount of such experience and a firm grasp of the basics. But now I can confidently rely on being able to do such tricky joins without the overhang of syntax. Well worth the $200/per year.

I'm not sure why the census breaks out the 20- and 21-year olds separately, but I'm sure Claude can help be meld those into the adjoining cohorts.s