When Delphi-Facebook and the U.S. Census Bureau provided estimates of COVID-19 vaccine uptake last spring, their weekly reports drew on responses from as many as 250,000 people.

The data sets boasted statistically tiny margins of error, raising confidence that the numbers were correct. But when the Centers for Disease Control and Prevention reported actual vaccination rates, the two polls were off — by a lot. By the end of May, the Delphi-Facebook study had overestimated vaccine uptake by 17 percentage points — 70 percent versus 53 percent, according to the CDC — and the Census Bureau’s Household Pulse Survey had done the same by 14 percentage points.

A comparative analysis by statisticians and political scientists from Harvard, Oxford, and Stanford universities concludes that the surveys fell victim to the “big data paradox,” a mathematical tendency of big data sets to minimize one type of error — due to small sample size — but magnify another that tends to get less attention: flaws linked to systematic biases that make the sample a poor representation of the larger population.

The big data paradox was identified and coined by one of the study’s authors, Harvard’s Xiao-Li Meng, the Whipple V.N. Jones Professor of Statistics, in his 2018 analysis of polling during the 2016 presidential election. Famous for predicting a Hillary Clinton victory, the polls were skewed by “nonresponse bias,” which in this case was the tendency of Trump voters to either not respond or define themselves as “undecided.”

A biased big data survey can be worse than no survey at all, says Meng, because with no survey, researchers at least understand that they don’t know the answer. When underlying bias is poorly understood — as in the 2016 election — it can be masked by confidence created by the large sample size, leading researchers and readers astray.

“The larger the data size, the surer we fool ourselves when we fail to account for bias in data collection,” the paper’s authors wrote in their analysis, published Wednesday in the journal Nature.

The misleading results can be particularly harmful when actions are based on them, the authors note. The governor of a state where a survey shows that 70 percent are vaccinated against COVID, for example, might relax public health measures. If actual vaccination rates are closer to 55 percent, the move could result in a spike in cases and a rise in COVID deaths.

“All around the world, policymakers and scientific advisers are trying to make sense of COVID data,” said Seth Flaxman, an associate professor at Oxford University, 2008 alumnus of Harvard’s computer science and mathematics program, and corresponding author in the paper. “Reported cases are a fraction of true infections, COVID-19-attributed deaths are a severe undercount of the true toll of this pandemic, and electronic medical records do not give us the full picture of long COVID. When it comes to survey data, all sorts of data quality issues, such as vaccinated respondents being more likely to respond to surveys and marginalized groups being underrepresented, can lead to incorrect estimates.”

Though it is broadly known that survey accuracy comes from both data quantity and data quality, quantity has stolen the spotlight in recent years as technology has dramatically increased our ability to collect and process massive data sets. The imbalance should put investigators on guard, said Shiro Kuriwaki, a first author of the paper who received his Ph.D. in government from Harvard last spring and is now a postdoctoral fellow at Stanford.

“There’s this drive to get the biggest data sets possible and modern technology, big data, has made that possible,” Kuriwaki said. “What that allows is analysis at a more granular level than ever before, but we need to be mindful that biases in the data get worse with bigger sample size, and that can carry right to the subgroups.”

Meng began thinking about the problems posed by big data when he and other statisticians met with a visiting U.S. Census official at Harvard a decade ago. Using the hypothetical of tax data collected by the IRS, the official asked the statisticians whether they would prefer a sample covering 5 percent of the population that they knew was representative of the larger population, or IRS data that they weren’t sure was representative but covered 80 percent of the population. The statisticians chose the 5 percent. “What if it was 90 percent?” the official asked. The statisticians still chose the 5 percent, because a solid understanding of the data meant that their answer would likely be more accurate than one based on a larger set with unknown biases.

“Every data set is going to have certain quirks, but the question is whether the quirk matters to whatever your problem is,” said Meng, whose work is partially funded by the National Science Foundation. “Social media has tons of data just sitting there. And they may think they have a public sample, but may not realize that their population is biased to start.”

Indeed, nonresponse bias remains pernicious even when researchers are mindful of it. For example, a 2020 article by Kuriwaki and another co-author of the current study, Harvard undergraduate Michael Isakov, correctly predicted overconfidence in 2020 presidential election polls even after new methods had been introduced in the aftermath of 2016.

“In the current paper, we found that while both the Delphi-Facebook and Census Bureau researchers attempted to account for potential issues, their corrections were simply not enough to alleviate all of the bias,” Isakov said.

The study — conducted with Oxford’s Dino Sejdinovic — identifies areas of potential bias in the vaccination polls. The Delphi-Facebook reports, which drew from daily users of the social media site, didn’t account for factors like education level and race and ethnicity. The Census Bureau study corrected for both education and race and ethnicity, but neither survey collected data on partisanship of respondents, which may influence vaccine uptake. Also, neither adjusted their sample to represent distribution of urban and rural areas, another potentially important factor.

“The U.S. government is spending billions of dollars this year doing targeted outreach to try to get people who are not vaccinated, vaccinated,” said Valerie Bradley ’14, an alumna of Harvard’s statistics program, Ph.D. student at Oxford University, and a first author of the paper. “And if you are guiding that based on the Census Household Pulse or Facebook survey, you might be pouring literally billions of dollars into the wrong communities.”

By comparison, researchers running a more traditional poll, conducted by Axios-Ipsos with just 1,000 respondents, took pains to ensure the sample was representative of the larger population. They accounted for education, race, ethnicity, political partisanship, and even provided tablets with internet access to “offline” respondents to ensure their points of view were registered. Despite the smaller sample size, the Axios-Ipsos estimates of vaccine uptake were similar to CDC numbers.

The ultimate effect of the uncorrected bias in the large polls, the authors said, was that the Delphi-Facebook poll, despite surveying 250,000 respondents, had an effective sample size when adjusted for bias of less than 10 in April 2021, a 99.99 percent reduction from their raw average weekly sample size. Similarly, the Census Household Pulse, which tallied 75,000 responses weekly, also had an effective sample size 99 percent lower in May 2021.

“If you have the resources, invest in data quality far more than you invest in data quantity,” Meng said. “Bad-quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. ”


An emergency response team for data?

Harvard Data Science Review airs COVID-19 findings

Breaking down boosters

What happens to our immune systems when we get a booster and will COVID boosters stay on the list?

Rapid rollout of COVID vaccine as important as its efficacy

Speed of procurement and distribution can maximize public health benefits, economic efficiency in low- and middle-income countries