Significant Anomalies (?) in 'Income by Age' against 'Population by Age' ... Thoughts?

#1
Anomalies in 'Income by Age' against 'Population by Age'

The published statistics
Current Population Survey Tables for Personal Income
By Bureau of Labor Statistics and the Census Bureau
pinc01_1_2_1
People 15 Years Old and Over by Total Money Income in 2020
Male
All Races
U.S. Census Bureau, Population Division, 2020 Demographic Analysis (December 2020 release)
Table 1. Total U.S. Resident Population by Age, Sex, and Series: April 1, 2020
Introduction
Hello everyone!
My first post here.
Ha! There will be more :)

The project is 'to extract population numbers by age, income, and (ultimately) region'.

I have succeeded in merging 'US Income Distribution to Over 250k by RaceMale 01 (pinc11_1)' with 'pinc01_1_2_1'
... this, in order to extend 'pinc01_1_2_1' increments from 'Over 100k' to 'Over 250k'.
Also, the new 'pinc01_1_2_1' can now be queried by actual-age and income.

There are issues:
  • Dividing the 5 year increments by 5, is certainly crude for the age group 25 to 30, but likely not problematic for 30+.
  • Also, there is the question of how to use regional populations that are quoted with their local median income.
I'll post specific questions on these issues.
However, I first needed to align 'pinc01_1_2_1' to the published population stats.

Anomalies in pinc01_1_2_1 (Table Image Attached)

It was quickly apparent that the income data for 15 to 24 year old's was irrelevant for my purposes - it merges too many school kids with workers.
With that data removed, and the totals updated ... I totaled the 'over 25 with income' = 103,521k

Census Population by Age (Table 1.) was loaded.
Series 'Low' was selected, because it best matched 'Population by State & County' total population.

The entire 'Series Low' was multiplied down to produce 'Over 25 Males' = 103,521k
The assumption being tested, was that income would follow a pattern.
Each 5 year group in 'Series Low' was then summed, to match the 5 year groups in 'pinc01_1_2_1'

The key year groups (25 - 44) were a very close match - (50 - 54) was also very close ... the rest were out, I believe, to a substantial amount.

I decide to test how close I could get the results.
Series low, was multiplied down to 'Over 25 Males' = 102,016k
This produced a very close correlation between population and the main 'year groups' in 'pinc01_1_2_1'

I then paused to reflect....
How could (45 to 50) be so far out?
Damnation! I nearly had a perfect series (25 - 55)

Have a look at the attached image....

(pinc01_1_2_1) against (Census Population).png
Note: This table is a numeric cut, from the full 'Series Low'.

Clearly, we are down 1,505k people, but (45 to 49) and (55 to 60) would have been much further out, and (65 to 75+) were anyway beyond hope.

However, the question doesn't relate to the multiplier being used.
From an engineers perspective:
(45 to 49) & (55 to 59) are wild positive anomalies.
(60 to 64) would probably scrape in
(65 to 75+) can be scrapped (which is fine) ... They seem to have taken the brunt of 'losing 1,505 people'
... and also, due to dodgy historic records, there may anyway be reason to discount them.

What Isn't Clear To Me is:
How is it that these anomalies didn't show up during 'cross-reference testing'?

What to do?
The option has occurred to me, of running appropriate multipliers across the 'pinc01_1_2_1 ranges.' so that they match the census.
At first glance, this doesn't sit right ... but I guess that I will need to anyway do that, for variable regional income medians (if that is even possible).

I stopped to look at other studies - one being a serious 'global wealth' study written by top professors; quoting US Adult totals that bore absolutely no relation to published census figures (out by over 40 million from 21+, and more from 18+).

It's disheartening; but I'm keeping my chin up, because I've got (25 to 44) in the bag ... Hahah!

I was thinking ... you chaps are experienced in this field ... what are your thoughts?
 
Last edited:
#2
After reflecting upon the previous days work, it became clear that I must run a comparison of the 'pinc01_1_2_1' Total Population 25+ against all the census results.

The principal being that this would eliminate any differences due to income.

I used the State & County projections in place of the Series Low, as it is very similar, and it would be the projection to use, as it groups the population down to county level.

The negative numbers have been changed from the previous table, in order to reflect the under, or over reporting of the population by the pinc study.

Results
The standout result is that the pinc over 75 population is projected to be larger than all the census population projections.
907k greater than the State & County projections
... this, while the age group (55 to 59) is 914k lesser than the State & County projections.

Population Totals by Age Group - All Census Test.png

'pinc01_1_2_1' population by age & income was then examined - in particular, the range 45 to 49.
There were numerous entries that stood out as being unusually low, when compared to the other ages within each earning range of 2.5k.

For example:

Code:
60k to 62.5k : 142 382 354 386 310   337   346 288 270 136 107 121
62.5k to 65k :  43  65 159 135 116    86   115  87 133 121  74 114
65k to 67.5k :  42 261 294 232 224   216   206 196 194 127 112  92

Income brackets starting with x2.5k are lower populated than brackets beginning with a whole number
... however, the 86 is low.
As is the 87 (55 to 59 age group)

There are 41 income brackets, so consequently, small errors can mount up.

Question
When there is cross-referenced evidence that 'under reporting' has occurred ... is it normal to balance the books; much as one might do when smoothing a curve?
 
Last edited:
#3
Raw data can be messy and sometimes politically charged.

I also requested data, for a year's births in the US (2002, about 2 million) to study the age at which we breed. The age ranges were inconsistent and required re-assembly of smaller bin totals to form equal sized bins.

It's pretty much as I expected: Breeding is very low at the onset of puberty 10-14, picks up and starts developing second children 15-19, and peaks with third child births 20-24, then tapers off 25-29, and declines rapidly 30-34+
 
Last edited:
#4
Thanks for that AngleWyrm ... those stats are contrary to what I assumed to be happening.
If I had to bet money, I would have put it on peaking at 25-29 (for 2002), and 20-24 (for 1982).

Perhaps a cross-reference study of 'births out of wedlock' might correlate, as we are generally informed that marriages are occurring more between 28-32 ... I briefly looked at the stats, but must delve deeper, once I have a handle on Age/Income.

If I understand you correctly ... your 'bin totals' methodology is akin to the pinc 41 x 12 data groups (age groups x income groups).

You stated "re-assembly of totals".
What I'm looking at, are small modifications to certain totals in the 492 bins, because some of the 12 x Age Range totals do not correlate to the projected population, nor to norms of 25-44 etc.

I see no other solution.

The pattern of people having income from 25 onward should be relatively stable ... a single (less than 1) multiplier to the population shows this to be correct for 25-44, 50-54, and 60-64.
There is no reason at all for this to vary dramatically for ages 45-49. and 55-59.

Adjusting those last two totals (to match the population projections & the norms of 25-54), by small adjustments across the 41 income ranges ... will produce a viable database for 25-64.
I'd then call it quits for 65+

Other than that; I take note of your opening statement.
:)
 
#5
Addendum: In my second post, the table was incorrectly titled 'Population Without Income'.
In fact the Title should have been 'Population Totals by Age Group'.

This has now been corrected.
 
#6
By re-assembly, I added together the data for age bands that were less than five years to get summary totals of those five years.

I would caution against altering observations to fit an expected pattern. It's ok to not yet have a complete understanding of the data presented, but it's bad science to change data to fit a model. Someone's gonna say "that's not what I saw" and the tomatoes start flying at the podium.

Maybe develop a function that estimates income given age, and overlay that on a chart of the data; in this way the idea of an expected pattern is explicitly expressed, and differences between the model and the data become clear. Then further investigations can be performed on those differences; maybe something happened during a period of time relevant to that age group. Maybe something happened during data collection. Maybe the model needs tuning.

Your approach of cross-referencing different data sources to get different perspectives on the information is solid, and develops a notion of trust where the more perspectives agree the more trustworthy the data.
 
Last edited:
#7
I would caution against altering observations to fit an expected pattern
Yes; in fact, this was the crux of the thread.
I was wanting to learn of the normal protocol, in these situations.
Thank you for clarifying this.

I had considered possible time relevant scenarios; but my latest comparative study above, indicates that the problem stems from the data not being proportional to the distribution.

The technical supplement (138 pages) https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar21.pdf
States:
"Currently, we interview about 54,000 households monthly, scientifically selected on the basis of area of residence to represent the nation as a whole ... A two-stage ratio estimation procedure adjusts the sample population to the known distribution of the entire population".
I have read and re-read it, highlighting relevant passages (it is effectively a free statistical text book - everything down to the actual questions)
The project covers a vast array of circumstances (housing, race, education etc. etc......) that are understandably fuzzy, that can only be determined through the use of all the formulae (that are listed).
However, the 'all race' 'male & female' Income by Age' is the least complex of all the studies, as it avoids all the fuzziness of race, housing, education, and income source.

The sampling does not include the institutionalized population
"consisting primarily of the population in correctional institutions and nursing homes (98 percent of the 4.0 million institutionalized people in the 2010 Census) ... work experience data are not collected for Armed Forces members
"
Prison Population year-end 2019: 1,430,200
Nursing Home Population year-end 2019: 1,246,079
Military ~ 1,400,000

According to urban myth, the average age of soldiers is "just 19" ( :D )
My study is 25+
Clearly, I need to examine the stats that are available.
However, as the prison population is spread across all ages (stats to be examined), the nursing home population will largely be above 75
... these make the pinc population distribution even more disproportionate, and it doesn't explain the disparity in the 45-49 and 55-59 age groups.

I have a pinc technical contact address ... I feel that I must use it, and see if I can gain clarification.
What the outcome will be (if any), is to be seen.
 

noetsi

No cake for spunky
#8
Although practitioners commonly don't agree with researchers on this point, it is generally felt by the latter that you only change data if it clearly is in error I think. Certainly this is true with outliers.

I doubt any real world data set is measured without error having worked for a state/federal agency for more than a decade. There are likely lots of errors.
 
#9
Thanks for your perspective noetsi.
I do appreciate that the results must contain errors ... but surely not the population age groups, which are fixed when a Census Population Series is chosen (Low, Mid, High, or State).

The clear concern here, is that the pinc total population per age group doesn't match any population distribution per age group.
- 'Matching' would presumably be the most basic requirement, as per the stated method.

Primarily; we have 3 stated population projections in the published Table 1, plus the State count (similar to the Series Low).
If we remove 4 million institutionalized citizens, less (say) 1 million under 25's, we would be looking at a match to 'Distribution Series Mid' - listed above in the table: Population Totals by Age Group.

Either way; if the published study is to match the population distribution, then surely the primary requirement would be to simply extract the age group totals ... and then project how many people from that age group, would have an income, where they fit into the income groups, and how they are housed (and who is receiving benefits etc.)

The chosen distribution could then be stated in the published stats.
The only variance would be 'those outside the universe' (pinc study terminology for the non-included)

We would be knowing that there would be errors in the placement of each individual, but at very least we would have the study population matching the projected population (as per the stated method).

Let us not forget, that the data comes from a monthly questionnaire (54,000 samples per month), AND the Population Census.
Therefore, all the data is a projection.
It can be calculated to any chosen total.

IE, Where the population totals are concerned, there is no need for any calculation, but for the removal of the non-included.
The calculation, would be to fit the data into the actual population (or at least our chosen population distribution).

Therefore, we shouldn't even be discussing errors in the primary population projections, because they should be the starting point ... almost copy & paste!

This would then enable perfect sub-division to state and county level (perfect to stated error correction, from decades of studies)

We are not dealing with different races ... but if we were, then those race totals combined should equal the All Race totals, that should equal the chosen population projection.
To me, this is fundamental calculation methodology, that applies to all fields of study.

From your experience noetsi. ... how do you see this calculation methodology applying in your field of expertise?

Note: This is a serious question, because I'm at a loss to understand the pinc totals, and I want to contact the pinc rep when I am best informed.
 

noetsi

No cake for spunky
#10
I am not sure why what you say would not be true. My point is that the real population, and even its distribution would not match estimates of it. The census numbers are not what the true population is at any point in time. Things that are defined, like age ranges, would seem to be true by definition since they only exist by definition. But inconsistencies can occur in reports or between reports. Different federal agencies define race differently I believe and this changes over time.
 
#11
Thanks for your reply noetsi, but it is all by the by now because I've solved the riddle of the pinc study!

... and I'm feeling so good ... relief, happy, and I have that sense of self-satisfaction that comes from cracking what is seemingly an impossible problem.

... and it is a solution that is a great share, because this problem of mis-matched totals, may be the norm in many studies (for reasons that we don't know).

The solution is so simple ... the key is knowing that the problem can be solved.
Sorry chaps, but I have to tell you how I found the solution ... Hahahah! It's payback for the hours and days I've spent getting nowhere :D

In fact, I had stated the solution in my last post (calculate to the correct total!) ... but I was morose at the time, and didn't appreciate that this is what 'I' should do (rather than the stat dev).

Instead, I created numerous tables, looking for a pattern.
It was getting out of hand, and I realised that I must accurately title, label, and highlight the data ... that was, and IS the key!

I would seriously advise anyone who is creating numerous test tables, to take the time to label them for 'at a glance' reference.
Not only is this good for the brain ... it enables you to see that the data is wrong (probably a copy and paste of a formula with a fixed reference). ;)

This morning, I took a fresh look, and decided to work on, comparing 'the pinc Total Population AND the With Income Population' TO the census count.

The reason being that I needed to understand what is the percentage of the (census) population, that has No Income, or Has Income.
I knew that this must be relatively stable after 30 ... perhaps higher 'below 30' (young adults living at home)., and perhaps dropping slightly with age.
I also thought perhaps that the State Safety Net might kick in more after 65 (explaining the the strange figures).

I began to think about multiplying the With Income Age Group Totals to match the census Age Group Totals ... and then I could apply a descending multiplier, to determine those who had income, and who didn't.
Yes! I actually toyed with this idea ... don't forget that I was lost.

Obviously, all the presented figures were wrong (hahahah there were more Over 75's than actually exist) o_O

... and then I saw it (the highlighting had paid off).
Can you see it (have a look)?
pinc_Total_Pop_and_Income_Pop_percent_of_Census.png

Have you got it?

I noticed that where the pinc Total Populations are high against the census count ... the percentage population Without Income is high.
... and 'at a glance, there seemed to be a pattern'.

I had the light-bulb moment!
Because I had prepared the data as percentages ... I could subtract the Total Population % from the With Income Population %
... and at a stroke, I would have the pinc Population Without Income % of the census count!!!!!!!!
IE. The Without Income percentage, of the Real Population (census count population).

I have never ever before, been so excited to type simple arithmetic (I knew it was going to work) :cool:
... and it did:
pinc_Total_Pop_and_Income_Pop_and_Without_Income_percent_of_Census.png
The % Without Income is pretty much as predicted.
Slightly lower than I had thought, but I know that it is correct to the data.
... and all the stupid, impossible problems with the ridiculous totals ... vanished!

It was then, and only then, that I realised that I had stated what should be done, in my previous post.

Why the original crazy totals?
It doesn't matter (my guess is that there is a good business case for it).
Either way; for those few of you who have read this thread ... when you fall upon stats with crazy totals, you will know to first re-total to what is known.

This one goes into my cache of 'great victories' ... even though it is a simple solution ... the key is in knowing the solution.

Next up:
I'm going to need help with State Median Income variables to adjust Income Group populations according to region.
I'll first produce the new tables, and then ask for advice.

Thanks to those who contributed to this discussion :)