Tables are a marvelous way of organizing data but they come at a cost, which is projecting multidimensional data onto a two-dimensional representation. Relational database management systems handle storage and retrieval of tabular data. The incantations to do this are uttered in structured query language—SQL—a powerful set of magical spells. For a programming language, it's quite readable.
SELECT custid, balance FROM orders;
But things can get quickly more complicated as anyone who uses SQL only occasionally is soon reminded. My most recent reminder came while preparing age pyramids from Census data. Age pyramids show the distribution of population, like this one.
The source data was cut and pasted from U.S. Census Bureau, U.S. Department of Commerce. Age and Sex. American Community Survey, ACS 1-Year Estimates Subject Tables, Table S0101, 2023. Accessed on February 8, 2025.
| age_groups | male | female |
|---|---|---|
| Under 5 years | 9373156 | 8960541 |
| 5 to 9 years | 10136158 | 9663272 |
| 10 to 14 years | 10877744 | 10326135 |
| 15 to 19 years | 11364472 | 10803918 |
| 20 to 24 years | 11073959 | 10544424 |
| 25 to 29 years | 11090992 | 10815714 |
| 30 to 34 years | 11811352 | 11593704 |
| 35 to 39 years | 11437479 | 11212620 |
| 40 to 44 years | 11100697 | 11025788 |
| 45 to 49 years | 9893835 | 9965395 |
| 50 to 54 years | 10297208 | 10364733 |
| 55 to 59 years | 9929586 | 10268922 |
| 60 to 64 years | 10543076 | 11132960 |
| 65 to 69 years | 9025813 | 10001148 |
| 70 to 74 years | 7367430 | 8430427 |
| 75 to 79 years | 5129494 | 6189257 |
| 80 to 84 years | 3013517 | 4027902 |
| 85 years and over | 2263405 | 3858663 |
That's easy enough, but what I was after was to do it for New England, its states and their counties. The Census county tables look like this
"GEO_ID","NAME","B01001_001E","B01001_001M","B01001_002E","B01001_002M","B01001_003E","B01001_003M","B01001_004E","B01001_004M","B01001_005E","B01001_005M","B01001_006E","B01001_006M","B01001_007E","B01001_007M","B01001_008E","B01001_008M","B01001_009E","B01001_009M","B01001_010E","B01001_010M","B01001_011E","B01001_011M","B01001_012E","B01001_012M","B01001_013E","B01001_013M","B01001_014E","B01001_014M","B01001_015E","B01001_015M","B01001_016E","B01001_016M","B01001_017E","B01001_017M","B01001_018E","B01001_018M","B01001_019E","B01001_019M","B01001_020E","B01001_020M","B01001_021E","B01001_021M","B01001_022E","B01001_022M","B01001_023E","B01001_023M","B01001_024E","B01001_024M","B01001_025E","B01001_025M","B01001_026E","B01001_026M","B01001_027E","B01001_027M","B01001_028E","B01001_028M","B01001_029E","B01001_029M","B01001_030E","B01001_030M","B01001_031E","B01001_031M","B01001_032E","B01001_032M","B01001_033E","B01001_033M","B01001_034E","B01001_034M","B01001_035E","B01001_035M","B01001_036E","B01001_036M","B01001_037E","B01001_037M","B01001_038E","B01001_038M","B01001_039E","B01001_039M","B01001_040E","B01001_040M","B01001_041E","B01001_041M","B01001_042E","B01001_042M","B01001_043E","B01001_043M","B01001_044E","B01001_044M","B01001_045E","B01001_045M","B01001_046E","B01001_046M","B01001_047E","B01001_047M","B01001_048E","B01001_048M","B01001_049E","B01001_049M",
"0500000US01001","Autauga County, Alabama","59285","*****","28669","263","1720","223","1855","279","2115","272","1306","84","575","97","214","157","336","124","1145","239","1930","154","1772","51","2264","239","1489","228","2001","239","1931","89","2133","236","743","177","1004","253","584","161","833","174","1098","213","980","148","338","104","303","116","30616","263","1710","204","1556","296","2223","306","1441","160","600","114","123","77","611","184","988","172","1841","41","2062","123","1744","278","2405","286","2016","114","1956","46","2231","267","758","185","1013","240","590","148","835","187","1519","203","1220","219","680","153","494","178",
This is a job for SQL, so I import it into my census database and turn for help to my buddy Claude
I have postgres tables census.counties with the following schema
geocoder=# \d census.counties
Table "census.counties"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+----------------------------------------------
gid | integer | | not null | nextval('census.counties_gid_seq'::regclass)
statefp | character varying(2) | | |
countyfp | character varying(3) | | |
countyns | character varying(8) | | |
geoidfq | character varying(14) | | |
geoid | character varying(5) | | |
name | character varying(100) | | |
namelsad | character varying(100) | | |
stusps | character varying(2) | | |
state_name | character varying(100) | | |
lsad | character varying(2) | | |
aland | double precision | | |
awater | double precision | | |
geom | geometry(MultiPolygon,4269) | | |
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_geoid_key" UNIQUE CONSTRAINT, btree (geoid)
"counties_geom_idx" gist (geom)
"counties_name_idx" btree (name)
Referenced by:
TABLE "census.variable_data" CONSTRAINT "variable_data_geoid_fkey" FOREIGN KEY (geoid) REFERENCES census.counties(geoid)
and a table census.variable_data with
geocoder=# \d census.variable_data
Table "census.variable_data"
Column | Type | Collation | Nullable | Default
---------------+------------------------+-----------+----------+---------
geoid | character varying(5) | | not null |
variable_name | character varying(50) | | not null |
value | bigint | | |
name | character varying(100) | | |
and variable_name has the following values
female_62_to_64_years
female_50_to_54_years
female
female_18_and_19_years
male_60_and_61_years
female_35_to_39_years
female_80_to_84_years
male_21_years
female_under_5_years
female_21_years
male_15_to_17_years
female_15_to_17_years
male_85_years_and_over
male_22_to_24_years
male_35_to_39_years
male
female_65_and_66_years
male_10_to_14_years
male_40_to_44_years
female_25_to_29_years
female_10_to_14_years
male_62_to_64_years
male_70_to_74_years
female_70_to_74_years
total_population
male_65_and_66_years
female_67_to_69_years
male_25_to_29_years
male_75_to_79_years
male_55_to_59_years
female_22_to_24_years
male_5_to_9_years
male_50_to_54_years
male_45_to_49_years
male_20_years
female_85_years_and_over
male_18_and_19_years
female_20_years
female_75_to_79_years
female_55_to_59_years
female_30_to_34_years
male_80_to_84_years
male_30_to_34_years
female_5_to_9_years
male_under_5_years
female_45_to_49_years
female_40_to_44_years
female_60_and_61_years
male_67_to_69_years
and I want a query to return this format
age_group male_pop female_pop
"0-4" 1000 950
"5-9" 950 900
for each state with an associated stusps code
6-element Vector{String}:
"MA"
"CT"
"NH"
"RI"
"VT"
"ME"
and for those six states as a group
This follows my usual query recipe, which is to describe in detail what I have and what I want.
The scripts come back. For the individual states
WITH age_groups AS (
SELECT
c.stusps,
CASE
WHEN vd.variable_name LIKE '%under_5%' THEN '0-4'
WHEN vd.variable_name LIKE '%5_to_9%' THEN '5-9'
WHEN vd.variable_name LIKE '%10_to_14%' THEN '10-14'
WHEN vd.variable_name LIKE '%15_to_17%' THEN '15-17'
WHEN vd.variable_name LIKE '%18_and_19%' THEN '18-19'
WHEN vd.variable_name LIKE '%20%' THEN '20'
WHEN vd.variable_name LIKE '%21%' THEN '21'
WHEN vd.variable_name LIKE '%22_to_24%' THEN '22-24'
WHEN vd.variable_name LIKE '%25_to_29%' THEN '25-29'
WHEN vd.variable_name LIKE '%30_to_34%' THEN '30-34'
WHEN vd.variable_name LIKE '%35_to_39%' THEN '35-39'
WHEN vd.variable_name LIKE '%40_to_44%' THEN '40-44'
WHEN vd.variable_name LIKE '%45_to_49%' THEN '45-49'
WHEN vd.variable_name LIKE '%50_to_54%' THEN '50-54'
WHEN vd.variable_name LIKE '%55_to_59%' THEN '55-59'
WHEN vd.variable_name LIKE '%60_and_61%' THEN '60-61'
WHEN vd.variable_name LIKE '%62_to_64%' THEN '62-64'
WHEN vd.variable_name LIKE '%65_and_66%' THEN '65-66'
WHEN vd.variable_name LIKE '%67_to_69%' THEN '67-69'
WHEN vd.variable_name LIKE '%70_to_74%' THEN '70-74'
WHEN vd.variable_name LIKE '%75_to_79%' THEN '75-79'
WHEN vd.variable_name LIKE '%80_to_84%' THEN '80-84'
WHEN vd.variable_name LIKE '%85%' THEN '85+'
END AS age_group,
CASE
WHEN vd.variable_name LIKE 'male%' THEN 'male'
WHEN vd.variable_name LIKE 'female%' THEN 'female'
END AS gender,
SUM(vd.value) as population
FROM census.counties c
JOIN census.variable_data vd ON c.geoid = vd.geoid
WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME')
AND vd.variable_name NOT IN ('male', 'female', 'total_population')
GROUP BY c.stusps, age_group, gender
)
SELECT
stusps,
age_group,
MAX(CASE WHEN gender = 'male' THEN population END) as male_pop,
MAX(CASE WHEN gender = 'female' THEN population END) as female_pop
FROM age_groups
WHERE age_group IS NOT NULL
GROUP BY stusps, age_group
ORDER BY stusps,
CASE
WHEN age_group = '85+' THEN 85
ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER)
END;
and for the roll-up
WITH age_groups AS (
SELECT
CASE
WHEN vd.variable_name LIKE '%under_5%' THEN '0-4'
-- [same CASE statement as above for age groups]
END AS age_group,
CASE
WHEN vd.variable_name LIKE 'male%' THEN 'male'
WHEN vd.variable_name LIKE 'female%' THEN 'female'
END AS gender,
SUM(vd.value) as population
FROM census.counties c
JOIN census.variable_data vd ON c.geoid = vd.geoid
WHERE c.stusps IN ('MA', 'CT', 'NH', 'RI', 'VT', 'ME')
AND vd.variable_name NOT IN ('male', 'female', 'total_population')
GROUP BY age_group, gender
)
SELECT
age_group,
MAX(CASE WHEN gender = 'male' THEN population END) as male_pop,
MAX(CASE WHEN gender = 'female' THEN population END) as female_pop
FROM age_groups
WHERE age_group IS NOT NULL
GROUP BY age_group
ORDER BY
CASE
WHEN age_group = '85+' THEN 85
ELSE CAST(SPLIT_PART(age_group, '-', 1) AS INTEGER)
END;
It just worked. This is something that I could have figured out eventually through trial and error and recourse to manuals, but it would have taken hours instead of minutes. Being able to be able to pose the question depended on a fair amount of such experience and a firm grasp of the basics. But now I can confidently rely on being able to do such tricky joins without the overhang of syntax. Well worth the $200/per year.
I'm not sure why the census breaks out the 20- and 21-year olds separately, but I'm sure Claude can help be meld those into the adjoining cohorts.s