Kentaro Hoffman

Explainer in progress. Please check back in a few weeks from now :)

What is "predicted data"?

In machine learning, "predicted data" are often thought of as the outputs from some kind of complicated algorithm. I opt for an even broader definition: any indirect measure of a conceptual variable where a better, more direct measure exists. This definition of course includes predictions from black-box machine learning or artificial intelligence models like chatGPT, but also includes other examples we encounter often as social scientists, like survey responses, interviews, imputations, statistical estimates, derived measures, and a whole host of other proxies. Below are some examples I have come across in my own research.

Every conceptual variable comes with different measurement challenges, but in general, more precise measurements are also more expensive to collect. A stylized image below shows that "ground truth" direct measures tend to live in the light blue region, with predicted data everywhere else. Because the best measures tend to be the most expensive, we often end up relying on cheaper and noisier predicted data in practice. But not all predicted data are created equal. The best are those depicted in the green region - relatively precise and cheap compared to the red region - noisy and expensive.

Predicted vs Ground Truth Data

Measurements Vary in both Cost and Precision

Variable	Ground Truth	Predicted
Cause of Death	Vital Registration	Verbal Autopsy
Obesity	Fat Percentage	BMI
Income	Admin Data	Self Reported
Environmental Attitude	Questionnaire	NLP Sentiment

What does it mean if statistical inference on predicted data is invalid?

In this context, valid statistical inference refers to both un-biased point estimates and precise uncertainty bounds. Relative to inference performed with "ground truth" outcomes, inference on predicted data may have biased point estimates due to systematic differences between predictions and the ground truth, and the reported uncertainty will be deceivingly narrow because it doesn't account for any of the prediction error.

Why does this matter? Consider a very simple hypothesis test where the p-value tells us whether or not an observed relationship between X and y is statistically significant. This conclusion is a function of both the point estimate and the uncertainty around that point estimate. Consider the stylized diagram below which demonstrates how bias and conservative uncertainty might lead to very different scientific conclusions.

Inference can have bias and/or misleading uncertainty

Inference on Predicted Data (IPD)

What is "predicted data"?

Measurements Vary in both Cost and Precision

What does it mean if statistical inference on predicted data is invalid?

Inference can have bias and/or misleading uncertainty

What is the intuition for how we correct this? cite IPD lit

IPD checklist - how do you identify if you (1) have an IPD problem & (2) have what you need to perform a correction?

Some examples? VA, BMI, etc.

Inference on Predicted Data (IPD)

What is "predicted data"?

Measurements Vary in both Cost and Precision

What does it mean if statistical inference on predicted data is invalid?

Inference can have bias and/or misleading uncertainty

What is the intuition for how we correct this? *cite IPD lit*

IPD checklist - how do you identify if you (1) have an IPD problem & (2) have what you need to perform a correction?

Some examples? VA, BMI, etc.

What is the intuition for how we correct this? cite IPD lit