Synthetic vs Human Research Evaluation

Over the past year, I've been circling a question that feels important for where design and research are heading: can synthetic interviews give us something useful in the early stages of understanding a problem space? Not as a gimmick, and not as a replacement for real conversations, but as a practical part of shaping hypotheses.

Recently, I had the right conditions to try this properly. I was already running research with real participants, and I had enough background material to build credible personas. So I created a parallel synthetic track, running the same discussion guide through my own setup, transcript by transcript. Two studies side by side—one human, one simulated.

I built my own rig rather than using one of the platforms that have emerged in this space. I wanted to get my fingers into the soil—understand the pitfalls and benefits directly—especially since design is such a fluid, problem-fitting exercise. The value of a flexible setup that can adapt to different test types felt higher than convenience, so long as I could prove it generates valuable input.

This past weekend I cleaned the data and compared the two sets, building a Python-generated report looking at thematic coverage, emotional tone, specificity, and linguistic markers worth paying attention to. The point wasn't to prove anything definitive, but to see whether there's enough correlation to argue there's value here for designers who want to move faster without cutting corners.

I went in with a strong view: synthetic research isn't a replacement for talking to real people, but it might widen the aperture of what we already think we know. There's always that early stage where the team operates on half-formed assumptions about motivations, barriers, and emotional patterns. We chip away at those a few interviews at a time, refining questions as we go. What I wanted to test was whether synthetic interviews, grounded in good persona research, could accelerate that exploration—surfacing themes in advance, pressure-testing hypotheses so the real qualitative work becomes more focused.

What stands out is that synthetic interviews track the core thematic structure surprisingly closely, but with a different texture. They're confident, fluent, tidy. Real people are contradictory, uncertain, occasionally messy. So the role I see emerging isn't synthetic versus human, but layered. Start with a hypothesis, build personas from what you know, run a synthetic round to see where the cracks form, then sharpen your questions for real conversations. That feels closer to quantitative-before-qualitative thinking, except through narrative instead of numbers.

There's also the possibility of using synthetic interviews to simulate groups that are harder to recruit, creating a proxy view before spending the time and money finding them. If that works, we might treat insight as existing in different proportions—fully human, blended, or fully synthetic—each with its own strengths and blind spots.

I don't think synthetic interviews add new human insight on their own. What they could do is increase the quality of the early stage by making real interviews smarter. And perhaps the interesting question isn't "can synthetic replace human?" But "what mix gets us to better design decisions sooner?"

Key Outcomes

Fidelity Score

85.1/100

Excellent Fidelity

In this evaluation the synthetic research closely mirrors human patterns. The overall quality is high enough to be useful for certain research applications.

Dimension	Human	Synthetic	Verdict	Interpretation
Thematic Coverage Theme correlation (r)	0.80 16/18 themes shared		Strong	Synthetic reliably identifies the same core topics. Trust it for initial theme exploration.
Sentiment & Emotion Average sentiment (-1 to +1)	0.15	0.14	Match	Nearly identical emotional tone. Neither overly positive nor artificially neutral.
Specificity & Detail Specificity score (0-10)	9.5	9.3	Match	Both highly specific. Synthetic provides concrete answers, not vague generalisations.
Linguistic Authenticity Hesitation markers (avg)	168	40	Gap	Synthetic is 4× more polished. Lacks natural "ums" and conversational hesitations.

Where The Synthetic Research Falls Short

Gap Type	Human	Synthetic	Impact
Generational Perspectives Financial model shifts across generations	20%	0%	Completely missing intergenerational comparisons
Cash Flow Urgency Timing and cash flow pressures	70%	30%	Underrepresents real-world financial stress
Theme Intensity Coverage depth per theme	1×	2-4×	Over-covers themes vs natural human depth

Bottom Line: In this case, it appears that synthetic research is a strong complement for theme exploration and hypothesis generation, but human research remains essential for capturing generational perspectives, real-world urgency, and authentic conversational nuance.

Thematic Analysis

Thematic Coverage Comparison: This radar chart shows the prevalence of each theme across human and synthetic transcripts. Areas where the shapes overlap indicate strong thematic alignment; divergent areas highlight where synthetic research may over- or under-represent topics.

Theme Prevalence Correlation: Each point represents a theme—its position shows how often it appeared in human (x-axis) vs synthetic (y-axis) transcripts. Points near the diagonal line indicate strong correlation. Points far from the line show themes where synthetic research diverges from human patterns.

Sentiment Analysis

Sentiment Distribution: Box plots showing the range of emotional tones across all transcripts. The left chart compares overall sentiment scores (-1 = negative, +1 = positive). The right chart shows within-transcript variance—high variance indicates the respondent expressed a mix of positive and negative sentiments.

Specificity & Concreteness

Specificity Comparison: Measures how concrete and detailed the responses are. Specificity score (0-10) assesses whether responses include specific examples vs vague generalisations. Named entities count tracks mentions of specific brands, products, people, or places.

Linguistic Markers: Measures natural speech patterns. Hesitation markers ("um", "uh", "you know") indicate authentic conversation. Vocabulary diversity measures word variety. Sentence length shows conciseness. Large gaps suggest synthetic text lacks conversational naturalness.

Methodology & Limitations

Research Context

To understand which of our organising principles help people to most effectively manage their money and what that might mean for a digital experience designed around customers' mental models rather than FS verticals. The evaluation compares synthetic interview transcripts (generated from detailed personas using an LLM) against genuine human interview transcripts from the same research study.

Data Sources

Corpus	Source	n	Format
Human	Real interviews conducted Nov 2025	10	Otter.ai transcripts (.txt)
Synthetic	LLM-generated from detailed personas	10	Markdown transcripts (.md)

Limitations & Caveats

Critical Limitation - Sample Size: With n=10 per group, statistical tests have low power (~0.3-0.4 for medium effects). P-values should be interpreted with extreme caution. This analysis is best viewed as exploratory/descriptive rather than confirmatory.

Limitation	Severity	Implications
LLM-on-LLM Evaluation	High	Claude is evaluating content generated by Claude. The model may systematically rate AI-generated text more favourably.
Small Sample Size	High	n=10 per group is insufficient for robust statistical inference. Confidence intervals are wide.
Domain Specificity	Medium	Findings are specific to personal finance/banking research. May differ in other domains.
Persona Design	Medium	Synthetic transcript quality depends heavily on persona design. Different approaches may yield different fidelity.

Technical Details: Model: claude-sonnet-4-20250514 | Temperature: 0.3 | Total API tokens: ~223,000