Synthetic Users Are Getting Real

TL;DR - A new method called Semantic Similarity Rating (SSR) makes synthetic respondents far more realistic, delivering human-like rankings, lifelike distributions, and explainable insights—so teams can iterate fast and validate smarter.

Introduction


For years, the idea of synthetic respondents—AI agents that mimic human survey participants has hovered somewhere between hype and hope.

The promise was clear — faster, cheaper insights without the delays or costs of traditional panels. But the reality? Synthetic respondents never quite behaved like real people. Their answers were too uniform, too safe, too average.

A recent research study presents a breakthrough method that brings synthetic respondents dramatically closer to the way real consumers think, respond, and express opinions in surveys.

The Problem with Traditional AI Surveys


In traditional research, Likert scales (1–5 ratings) are the bedrock of consumer research — measuring purchase intent, brand favorability, and satisfaction.

However, researchers are well aware that Likert-scale studies have inherent limitations, as responses can be influenced by people taking the easiest option, agreeing with statements just to be polite, or giving overly positive answers.

As a result, even traditional consumer panels, despite requiring significant investment often produce noisy and unreliable measurements of actual demand.

LLMs to Simulate Real Consumer Behavior


With LLMs getting much more capable day by day, a lot of teams have been trying to augment human survey panels with synthetic consumers.

By conditioning LLMs on demographic or attitudinal personas and exposing them to the same survey instruments, researchers have begun exploring whether such synthetic samples can recover human-like patterns of response.

But when LLMs are asked to “rate on a scale of 1 to 5,” something strange happens.

Instead of showing the natural spread of human responses (some people love a product, others don’t), AI models cluster around the middle. They almost always choose 3s and 4s, rarely 1s or 5s. In statistical terms, the variance collapses—the data becomes flat and lifeless.

This study tries to fix this.

A New Way: Make AI Explain Itself


Instead of forcing a number, the researchers asked the model to describe its intent first.

They prompted the model with realistic product descriptions—such as a new snack concept, a skincare line, or a consumer device and asked it to respond as if it were a real consumer:

How likely would you be to purchase this product? Please explain your reasoning briefly.

The model’s response wasn’t a number, but a free-text rationale—a few sentences describing its intent, e.g.,

“I would probably try this product since I often buy similar snacks and it sounds healthy.”

Then came the clever part: Semantic Similarity Rating (SSR)

Each rationale was compared against five anchor statements aligned to the Likert scale—from “Definitely would not purchase” (1) to “Definitely would purchase” (5).

Using semantic embeddings, the system measured how similar the rationale was to each anchor and mapped it to a probability distribution across 1–5. Instead of a flat score, each answer became a statistical fingerprint that reflected the strength and direction of sentiment.

The Experiment


To test the approach, the team ran 57 separate studies across diverse consumer categories, covering over 9,300 real human respondents.

They then recreated those studies synthetically using LLMs, comparing three methods:

  1. Direct Likert Rating (DLR): rate 1–5 directly.
  2. Free Likert Rating (FLR): open-ended rating plus rationale.
  3. Semantic Similarity Rating (SSR): rationale mapped to Likert anchors via embeddings.

The benchmark was simple: could any AI method reproduce the patterns seen in human data—both in concept rankings and in distribution shape?

Results: A Leap in Realism

1. Human-like rankings


When concepts were ranked by mean purchase intent, SSR reached nearly 90% of the correlation that two independent human samples achieve with each other.

Put plainly, SSR-based synthetic respondents were nearly as consistent with human outcomes as humans are with one another.

2. Realistic distributions


Beyond averages, SSR captured the spread of opinions. Using the KS statistic to compare distributions, SSR performed three to four times better than direct 1–5 ratings.

Rather than bland bell curves, SSR reproduced the peaks, dips, and polarization you see in real panels.

3. The “Why” behind the score


Every SSR response includes a rationale. These qualitative snippets enable sentiment analysis, theme extraction, and message diagnostics at scale.

For example, in food concept tests, synthetic respondents didn’t just rate intent; they surfaced taste cues, perceived health benefits, price sensitivity, and usage occasions.

4. Matching Human Subgroup Patterns.


The researchers also explored whether AI could reproduce differences across demographic subgroups—such as age, gender, and income.

When they conditioned prompts with demographic cues (e.g., “Respond as a 45-year-old woman earning $80,000”), SSR captured consistent directional differences similar to those observed in human data.

It wasn’t perfect—finer nuances like regional variations were harder to mirror—but age and income effects translated reasonably well.

At-a-Glance Metrics

Metric Human Baseline Direct Likert SSR Method
Rank Correlation with Human Data 100% ~60% ~90%
Distribution Similarity (KS Index) 1.0 (perfect) ~0.3 ~0.85
Qualitative Insights None Minimal Rich rationales
Cost / Time per Study High Low Low

Why It Matters for Enterprise Teams


For CPG, Retail, BFSI, Healthcare, and Tech organizations —where multi-country concept tests can take a few months —SSR compresses timelines to hours at a fraction of the cost.

The real advantage is iteration. Teams can pre-test dozens of ideas overnight, shortlist winners, and invest human-panel budgets where they matter.

SSR isn’t a replacement for human research; it’s a scalable simulation layer that de-risks spend and accelerates learning.

Conclusion


Synthetic respondents have finally taken a meaningful step toward realism. By asking models to explain themselves and translating those explanations into probabilistic ratings, SSR delivers human-like rankings, lifelike distributions, and built-in qualitative insights.

Use it to explore widely, iterate quickly, and reserve your human panels for decisive validation.