Sandia Simpson’s Paradox: a data dust-up

[Some details on A Nation at Risk, as previously discussed here.]

In the widely covered A Nation at Risk, one of the featured datasets is national SAT/ACT scores.[1] And the data demonstrated a downward slide in test scores, much to no one’s surprise.

In 1990, a random Navy Admiral commissioned some data nerds at Sandia Labs (national nuclear labs) to review the data in A Nation at Risk. (No one seems to know why this happened other than the Admiral suspected the analysis in A Nation at Risk was wrong.)

The Sandia nerds agreed: some of the conclusions in A Nation at Risk were wrong. What Sandia discovered is that while the national test score trend was declining, subgroups (or cohorts) in that trend demonstrated increasing scores. So:

national trend: down

subgroups: up

So, test scores hadn’t been declining. (Really, they’d pretty much been staying the same.)

Needless to say, the Department of Education firmly rejected this analysis. Diane Ravitch, then employed by the Department of Education, claimed that Sandia was arguing to “Just take out the scores of black and Hispanic and inner-city kids … and the picture was rosy.”[2]

I don’t know if Diane Ravitch is a liar, stupid, or evil. It’s certainly evil to wantonly accuse people of racism. But Sandia was most certainly not making that argument.

Rather, Sandia observed a Simpson’s Paradox, which is actually quite common in certain datasets (education, healthcare, etc.). Whenever one has averages of subgroups, and then one aggregates those averages into an overall average, one is always making a weighting decision. One may simply combine the averages, or one may combine the quantities in the subgroups and recompute the average (which is what usually happens) — either way, a weighting decision has been made.

A Simpson’s Paradox simply observes that with averages over time, the average may change but also the weighting might change, and new averages may be the result of the changing weighting (in education and healthcare, that often means shifting populations … in quantity, geography, etc.).

Primer: all average computations have three components: the total quantity (of whatever you’re averaging), the number of things you’re averaging, and the average. The question becomes: if the average changed, is it because the quantity or the number changed? Or both? By how much? When aggregating averages, “the number of things” acts as a weighting mechanism. Over time, if the rate of change of “the number of things” is greater than the rate of change of the average, it will either magnify the change in the average if going in the same direction or counteract the average if going in opposite directions (direction = increase or decrease).

So, in education assessment statistics, if the rate of change of “the number of things” in the average of the lowest cohort is greater than the rate of change in the improvement of the cohort, it will negatively affect the larger average in proportion to the number of that cohort to the number of the entire group.

It works the other way too — even if your highest cohort is improving, if their “number of things” is rapidly decreasing, that improvement may not appear in the overall average.

So you can get increases across all cohorts individually while the overall average (i.e. test scores) decreases.

What few in education want to admit (even though some know this) is that any (or all) increase or decrease in “progress” may only be a reflection of a change in demographics. And a change in data demographics may be caused by population shifts, or changes in definitions, or any number of other issues. Does ‘white’ include the Polish kid who just stepped off the plane and doesn’t really speak English? Does ‘Asian’ include Japanese and Laotian? That’s not a random question — just check out how many Laotians are enrolled in Houston public schools. And we’ve yet to touch on the patently racist absurdity of aggregating all blacks into one category; there’s no data that suggests that all ‘black’ students should be grouped. (Just compare rural Louisiana blacks to urban Milwaukee blacks, then ask yourself why those Milwaukee blacks consistently score so much lower.)

And, ultimately, such data funkiness occurs because we’ve created artificial subgroups. The math is behaving perfectly normally, but when one decides to create cohorts by an artificial standard such as skin pigmentation, then the math appears to go wonky. It’s not the math that’s wonky; it’s you.

So an apparent decline in scores in the urban areas in the 1960s and 1970s may just be population shifts (e.g. white flight); minorities in those communities may actually be doing better. Or, if a district has an influx of immigrants, test scores may be affected. If a bunch of Nigerians step off a plane, they’ll get lumped into the “black” category, or if Ft. Lee New Jersey gets flooded with Koreans, the district may experience sudden “progress” (a Ft. Lee school was once ranked top ten in the nation … was it a good school or was the improvement just the result of the Korean flood?). Over the past 15 years, there’s been an influx of Puerto Ricans in central Florida — who are the same ‘Hispanic’ as the Cubans in south Florida. But if you think they’re the same, then I cordially invite you to go to Miami, find a Cuban, and tell him that. You may want to bring a friend to call 911 for you.

Overall averages — across districts, cities, states — conceal all that. And that’s what Sandia was pointing out.

QED: Navy 1–0 Department of Education.

Perhaps most amazing in the Sandia dust-up was that no one pointed out the obvious — all of this was a squabble over a few points on the SAT (16 overall, to be exact) — and a similarly small number of points on the ACT. Invective riles the cackles of education people over what any reasonably informed ed-data person will tell you is statistically meaningless. The nation gets worked up over a few points increase or decline on the SAT, and yet such a difference is often the result of just a single additional question (correct or incorrect), or sometimes it’s the exact same number of questions (correct or incorrect) just producing different scores (depending on how the test is curved). And the real-world effect of such a difference is absolutely nothing.

So, really, who cares?

[1] SAT/ACT scores were used because they were the only high-quality national ed datasets available.