How Accurate and Reliable Are Synthetic Users?
Let’s address the elephant in the room.
You’re a trained researcher—skilled in study design, spotting bias, and separating signal from noise. You’ve seen too many insights derailed by bad data or flawed methodology.
So when someone claims an AI can simulate consumer responses and deliver reliable insights, your skepticism isn’t just fair—it’s professional.
The real question isn’t whether to be skeptical, but whether the evidence justifies moving from doubt to cautious confidence.
Let’s explore that evidence.
The Academic Validation: What the Research Shows
Over the past two years (2023-2025), a growing body of peer-reviewed research has tested synthetic users against real human participants. The findings are surprisingly consistent.
Study 1: The 90% Test-Retest Reliability Benchmark
In October 2024, researchers published a breakthrough study titled "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings".
The test: They compared LLM-generated synthetic user responses against real human responses across 57 consumer product surveys involving 9,300 human participants from a major personal care corporation.
The result: Synthetic users achieved 90% of human test-retest reliability.
Why does this matter? Test-retest reliability measures how consistently people answer the same question when asked twice. Humans don't perfectly replicate their own answers—life circumstances change, moods shift, memory fades. If humans score 100% reliability against themselves, and synthetic users score 90% compared to those same humans, that's remarkably close.
The implication: For directional insights and pattern detection, synthetic users perform nearly as reliably as real participants perform against themselves.
Study 2: Harvard Business School on Willingness-to-Pay
Harvard Business School researchers (Brand, Israeli, Ngwe) tested whether GPT-derived willingness-to-pay estimates matched real consumer studies.
Key findings:
- WTP estimates from LLMs were realistic and comparable to human studies
 - Fine-tuning with previous survey data improved alignment significantly
 - The method worked well for testing new product features
 - Limitations appeared when testing entirely novel product categories
 
The implication: Synthetic users are particularly effective when you have some baseline data to calibrate against. They're less effective in completely uncharted territory.
Study 3: Generative Agent Simulations at Scale
A November 2024 study titled "Generative Agent Simulations of 1,000 People" simulated 1,052 real individuals using qualitative interview data.
The result: Generative agents replicated participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later.
The implication: At scale, synthetic users achieve reliability levels approaching human consistency—a threshold that makes them genuinely useful for early-stage research.
Study 4: Real-World Validation at EY
Moving from academic studies to corporate validation: When EY tested Evidenza's synthetic consumer technology, they ran it against their annual brand survey of senior executives at major companies.
The result: 95% correlation between synthetic and real survey responses.
Toni Clayton-Hine, EY Americas CMO, described the matches as "astounding"—and notably, the synthetic study was completed in "a few days" versus "months" for the traditional survey.
The implication: In real-world business contexts (not just academic labs), synthetic users can produce results that closely mirror human studies.
[Diagram placeholder: Accuracy comparison chart showing 85-95% range across multiple studies]
How Accuracy Is Measured
When researchers talk about synthetic user "accuracy," they're usually measuring one of these metrics:
Response distribution matching: Do synthetic users produce similar percentages? (E.g., if 62% of real consumers prefer Option A, do synthetic users produce ~60-65%?)
Rank-order correlation: Do synthetic users rank product concepts in the same order as real consumers? (Most important → least important)
Statistical similarity (KS score): How closely do the full distributions match? (Most studies report >0.85 similarity)
Predictive accuracy: When synthetic users predict outcomes (like purchase likelihood), how often are they directionally correct?
It's important to note: Synthetic users don't perfectly replicate human variance. They tend to produce responses that are:
- Slightly more positive than humans (less negativity bias)
 - Slightly less variable (narrower distribution)
 - More consistent (less random noise)
 
This doesn't mean they're "wrong"—it means they're smoothed versions of human patterns. For directional insights ("Which concept performs better?"), this is fine. For capturing the full messiness of human opinion, real participants are still essential.