Inference on Predicted Data (IPD)

The problem: a researcher is interested in studying an outcome, Y, which is difficult to measure due to practical constraints such as time or cost. But they do have access to relatively cheap predictions of Y. They hypothesize that Y is associated with X, a set of features which are easier to measure. Their goal is to estimate a parameter of scientific interest, θ, which describes the relationship between X and Y. How can the researcher attain valid estimates of θ relying upon mostly predicted outcomes of Y?

Regression we want is Expensive but Precise   →     Y = θ1 X
Regression we have is Cheaper but Noisier       →     Ŷ = θ2 X
Importantly, θ1 is not the same as θ2!

For a more extensive overview, please see Hoffman et al. 2024 and the code respository for the `ipd` package.

Image 1 Image 2
Image 3 Image 4
Explainer in progress. Please check back in a few weeks from now :)

What is "predicted data"?


In machine learning, "predicted data" are often thought of as the outputs from some kind of complicated algorithm. I opt for an even broader definition: any indirect measure of a conceptual variable where a better, more direct measure exists. This definition of course includes predictions from black-box machine learning or artificial intelligence models like chatGPT, but also includes other examples we encounter often as social scientists, like survey responses, interviews, imputations, statistical estimates, derived measures, and a whole host of other proxies. Below are some examples I have come across in my own research.

Every conceptual variable comes with different measurement challenges, but in general, more precise measurements are also more expensive to collect. A stylized image below shows that "ground truth" direct measures tend to live in the light blue region, with predicted data everywhere else. Because the best measures tend to be the most expensive, we often end up relying on cheaper and noisier predicted data in practice. But not all predicted data are created equal. The best are those depicted in the green region - relatively precise and cheap compared to the red region - noisy and expensive.

Predicted vs Ground Truth Data

Measurements Vary in both Cost and Precision

Variable Ground Truth Predicted
Cause of Death Vital Registration Verbal Autopsy
Obesity Fat Percentage BMI
Income Admin Data Self Reported
Environmental Attitude Questionnaire NLP Sentiment
Predictions Vary


What does it mean if statistical inference on predicted data is invalid?


In this context, valid statistical inference refers to both un-biased point estimates and precise uncertainty bounds. Relative to inference performed with "ground truth" outcomes, inference on predicted data may have biased point estimates due to systematic differences between predictions and the ground truth, and the reported uncertainty will be deceivingly narrow because it doesn't account for any of the prediction error.

Why does this matter? Consider a very simple hypothesis test where the p-value tells us whether or not an observed relationship between X and y is statistically significant. This conclusion is a function of both the point estimate and the uncertainty around that point estimate. Consider the stylized diagram below which demonstrates how bias and conservative uncertainty might lead to very different scientific conclusions.

Inference can have bias and/or misleading uncertainty

Valid Inference

What is the intuition for how we correct this? *cite IPD lit*



IPD checklist - how do you identify if you (1) have an IPD problem & (2) have what you need to perform a correction?



Some examples? VA, BMI, etc.