Tables are a marvelous way of organizing data but they come at a cost, which is projecting multidimensional data onto a two-dimensional representation. Relational database management systems handle storage and retrieval of tabular data. The incantations to do this are uttered in structured query language—SQL—a powerful set of magical spells. For a programming language, it's quite readable.
SELECT custid, balance FROM orders;
But things can get quickly more complicated as anyone who uses SQL only occasionally is soon reminded. My most recent reminder came while preparing age pyramids from Census data. Age pyramids show the distribution of population, like this one.
The source data was cut and pasted from U.S. Census Bureau, U.S. Department of Commerce. Age and Sex. American Community Survey, ACS 1-Year Estimates Subject Tables, Table S0101, 2023. Accessed on February 8, 2025.
age_groups | male | female |
---|---|---|
Under 5 years | 9373156 | 8960541 |
5 to 9 years | 10136158 | 9663272 |
10 to 14 years | 10877744 | 10326135 |
15 to 19 years | 11364472 | 10803918 |
20 to 24 years | 11073959 | 10544424 |
25 to 29 years | 11090992 | 10815714 |
30 to 34 years | 11811352 | 11593704 |
35 to 39 years | 11437479 | 11212620 |
40 to 44 years | 11100697 | 11025788 |
45 to 49 years | 9893835 | 9965395 |
50 to 54 years | 10297208 | 10364733 |
55 to 59 years | 9929586 | 10268922 |
60 to 64 years | 10543076 | 11132960 |
65 to 69 years | 9025813 | 10001148 |
70 to 74 years | 7367430 | 8430427 |
75 to 79 years | 5129494 | 6189257 |
80 to 84 years | 3013517 | 4027902 |
85 years and over | 2263405 | 3858663 |
That's easy enough, but what I was after was to do it for New England, its states and their counties. The Census county tables look like this
"GEO_ID","NAME","B01001_001E","B01001_001M","B01001_002E","B01001_002M","B01001_003E","B01001_003M","B01001_004E","B01001_004M","B01001_005E","B01001_005M","B01001_006E","B01001_006M","B01001_007E","B01001_007M","B01001_008E","B01001_008M","B01001_009E","B01001_009M","B01001_010E","B01001_010M","B01001_011E","B01001_011M","B01001_012E","B01001_012M","B01001_013E","B01001_013M","B01001_014E","B01001_014M","B01001_015E","B01001_015M","B01001_016E","B01001_016M","B01001_017E","B01001_017M","B01001_018E","B01001_018M","B01001_019E","B01001_019M","B01001_020E","B01001_020M","B01001_021E","B01001_021M","B01001_022E","B01001_022M","B01001_023E","B01001_023M","B01001_024E","B01001_024M","B01001_025E","B01001_025M","B01001_026E","B01001_026M","B01001_027E","B01001_027M","B01001_028E","B01001_028M","B01001_029E","B01001_029M","B01001_030E","B01001_030M","B01001_031E","B01001_031M","B01001_032E","B01001_032M","B01001_033E","B01001_033M","B01001_034E","B01001_034M","B01001_035E","B01001_035M","B01001_036E","B01001_036M","B01001_037E","B01001_037M","B01001_038E","B01001_038M","B01001_039E","B01001_039M","B01001_040E","B01001_040M","B01001_041E","B01001_041M","B01001_042E","B01001_042M","B01001_043E","B01001_043M","B01001_044E","B01001_044M","B01001_045E","B01001_045M","B01001_046E","B01001_046M","B01001_047E","B01001_047M","B01001_048E","B01001_048M","B01001_049E","B01001_049M", "0500000US01001","Autauga County, Alabama","59285","*****","28669","263","1720","223","1855","279","2115","272","1306","84","575","97","214","157","336","124","1145","239","1930","154","1772","51","2264","239","1489","228","2001","239","1931","89","2133","236","743","177","1004","253","584","161","833","174","1098","213","980","148","338","104","303","116","30616","263","1710","204","1556","296","2223","306","1441","160","600","114","123","77","611","184","988","172","1841","41","2062","123","1744","278","2405","286","2016","114","1956","46","2231","267","758","185","1013","240","590","148","835","187","1519","203","1220","219","680","153","494","178",
This is a job for SQL, so I import it into my census database and turn for help to my buddy Claude
I have postgres tables census.counties with the following schema geocoder=# \d census.counties Table "census.counties" Column | Type | Collation | Nullable | Default ------------+-----------------------------+-----------+----------+---------------------------------------------- gid | integer | | not null | nextval('census.counties_gid_seq'::regclass) statefp | character varying(2) | | | countyfp | character varying(3) | | | countyns | character varying(8) | | | geoidfq | character varying(14) | | | geoid | character varying(5) | | | name | character varying(100) | | | namelsad | character varying(100) | | | stusps | character varying(2) | | | state_name | character varying(100) | | | lsad | character varying(2) | | | aland | double precision | | | awater | double precision | | | geom | geometry(MultiPolygon,4269) | | | Indexes: "counties_pkey" PRIMARY KEY, btree (gid) "counties_geoid_key" UNIQUE CONSTRAINT, btree (geoid) "counties_geom_idx" gist (geom) "counties_name_idx" btree (name) Referenced by: TABLE "census.variable_data" CONSTRAINT "variable_data_geoid_fkey" FOREIGN KEY (geoid) REFERENCES census.counties(geoid) and a table census.variable_data with geocoder=# \d census.variable_data Table "census.variable_data" Column | Type | Collation | Nullable | Default ---------------+------------------------+-----------+----------+--------- geoid | character varying(5) | | not null | variable_name | character varying(50) | | not null | value | bigint | | | name | character varying(100) | | | and variable_name has the following values female_62_to_64_years female_50_to_54_years female female_18_and_19_years male_60_and_61_years female_35_to_39_years female_80_to_84_years male_21_years female_under_5_years female_21_years male_15_to_17_years female_15_to_17_years male_85_years_and_over male_22_to_24_years male_35_to_39_years male female_65_and_66_years male_10_to_14_years male_40_to_44_years female_25_to_29_years female_10_to_14_years male_62_to_64_years male_70_to_74_years female_70_to_74_years total_population male_65_and_66_years female_67_to_69_years male_25_to_29_years male_75_to_79_years male_55_to_59_years female_22_to_24_years male_5_to_9_years male_50_to_54_years male_45_to_49_years male_20_years female_85_years_and_over male_18_and_19_years female_20_years female_75_to_79_years female_55_to_59_years female_30_to_34_years male_80_to_84_years male_30_to_34_years female_5_to_9_years male_under_5_years female_45_to_49_years female_40_to_44_years female_60_and_61_years male_67_to_69_years and I want a query to return this format age_group male_pop female_pop "0-4" 1000 950 "5-9" 950 900 for each state with an associated stusps code 6-element Vector{String}: "MA" "CT" "NH" "RI" "VT" "ME" and for those six states as a group
This follows my usual query recipe, which is to describe in detail what I have and what I want.
The scripts come back. For the individual states
WITH age_groups AS ( SELECT c.stusps, CASE WHEN vd.variable_name LIKE '%under_5%' THEN '0-4' WHEN vd.variable_name LIKE '%5_to_9%' THEN '5-9' WHEN vd.variable_name LIKE '%10_to_14%' THEN '10-14' WHEN vd.variable_name LIKE '%15_to_17%' THEN '15-17' WHEN vd.variable_name LIKE '%18_and_19%' THEN '18-19' WHEN vd.variable_name LIKE '%20%' THEN '20' WHEN vd.variable_name LIKE '%21%' THEN '21' WHEN vd.variable_name LIKE '%22_to_24%' THEN '22-24' WHEN vd.variable_name LIKE '%25_to_29%' THEN '25-29' WHEN vd.variable_name LIKE '%30_to_34%' THEN '30-34' WHEN vd.variable_name LIKE '%35_to_39%' THEN '35-39' WHEN vd.variable_name LIKE '%40_to_44%' THEN '40-44' WHEN vd.variable_name LIKE '%45_to_49%' THEN '45-49' WHEN vd.variable_name LIKE '%50_to_54%' THEN '50-54' WHEN vd.variable_name LIKE '%55_to_59%' THEN '55-59' WHEN vd.variable_name LIKE '%60_and_61%' THEN '60-61' WHEN vd.variable_name LIKE '%62_to_64%' THEN '62-64' WHEN vd.variable_name LIKE '%65_and_66%' THEN '65-66' WHEN vd.variable_name LIKE '%67_to_69%' THEN '67-69' WHEN vd.variable_name LIKE '%70_to_74%' THEN '70-74' WHEN vd.variable_name LIKE '%75_to_79%' THEN '75-79' WHEN vd.variable_name LIKE '%80_to_84%' THEN '80-84' WHEN vd.variable_name LIKE '%85%' THEN '85+' END AS age_group, CASE WHEN vd.variable_name LIKE 'male%' THEN 'male' WHEN vd.variable_name LIKE 'female%' THEN 'female' END AS gender, SUM(vd.value) as population FROM census.counties c JOIN census.variable_data vd ON c.geoid = vd.geoid WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME') AND vd.variable_name NOT IN ('male', 'female', 'total_population') GROUP BY c.stusps, age_group, gender ) SELECT stusps, age_group, MAX(CASE WHEN gender = 'male' THEN population END) as male_pop, MAX(CASE WHEN gender = 'female' THEN population END) as female_pop FROM age_groups WHERE age_group IS NOT NULL GROUP BY stusps, age_group ORDER BY stusps, CASE WHEN age_group = '85+' THEN 85 ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER) END;
and for the roll-up
WITH age_groups AS ( SELECT CASE WHEN vd.variable_name LIKE '%under_5%' THEN '0-4' -- [same CASE statement as above for age groups] END AS age_group, CASE WHEN vd.variable_name LIKE 'male%' THEN 'male' WHEN vd.variable_name LIKE 'female%' THEN 'female' END AS gender, SUM(vd.value) as population FROM census.counties c JOIN census.variable_data vd ON c.geoid = vd.geoid WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME') AND vd.variable_name NOT IN ('male', 'female', 'total_population') GROUP BY age_group, gender ) SELECT age_group, MAX(CASE WHEN gender = 'male' THEN population END) as male_pop, MAX(CASE WHEN gender = 'female' THEN population END) as female_pop FROM age_groups WHERE age_group IS NOT NULL GROUP BY age_group ORDER BY CASE WHEN age_group = '85+' THEN 85 ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER) END;
It just worked. This is something that I could have figured out eventually through trial and error and recourse to manuals, but it would have taken hours instead of minutes. Being able to be able to pose the question depended on a fair amount of such experience and a firm grasp of the basics. But now I can confidently rely on being able to do such tricky joins without the overhang of syntax. Well worth the $200/per year.
I'm not sure why the census breaks out the 20- and 21-year olds separately, but I'm sure Claude can help be meld those into the adjoining cohorts.s