Synthetic vs Human Research Evaluation

Comparative analysis of qualitative research transcripts

Over the past year, I've been circling a question that feels important for where design and research are heading: can synthetic interviews give us something useful in the early stages of understanding a problem space? Not as a gimmick, and not as a replacement for real conversations, but as a practical part of shaping hypotheses. Recently, I had the right conditions to try this properly. I was already running research with real participants, and I had enough background material to build credible personas. So I created a parallel synthetic track, running the same discussion guide through my own setup, transcript by transcript. Two studies side by side—one human, one simulated. I built my own rig rather than using one of the platforms that have emerged in this space. I wanted to get my fingers into the soil—understand the pitfalls and benefits directly—especially since design is such a fluid, problem-fitting exercise. The value of a flexible setup that can adapt to different test types felt higher than convenience, so long as I could prove it generates valuable input.

This past weekend I cleaned the data and compared the two sets, building a Python-generated report looking at thematic coverage, emotional tone, specificity, and linguistic markers worth paying attention to. The point wasn't to prove anything definitive, but to see whether there's enough correlation to argue there's value here for designers who want to move faster without cutting corners.

I went in with a strong view: synthetic research isn't a replacement for talking to real people, but it might widen the aperture of what we already think we know. There's always that early stage where the team operates on half-formed assumptions about motivations, barriers, and emotional patterns. We chip away at those a few interviews at a time, refining questions as we go. What I wanted to test was whether synthetic interviews, grounded in good persona research, could accelerate that exploration—surfacing themes in advance, pressure-testing hypotheses so the real qualitative work becomes more focused.

What stands out is that synthetic interviews track the core thematic structure surprisingly closely, but with a different texture. They're confident, fluent, tidy. Real people are contradictory, uncertain, occasionally messy. So the role I see emerging isn't synthetic versus human, but layered. Start with a hypothesis, build personas from what you know, run a synthetic round to see where the cracks form, then sharpen your questions for real conversations. That feels closer to quantitative-before-qualitative thinking, except through narrative instead of numbers. There's also the possibility of using synthetic interviews to simulate groups that are harder to recruit, creating a proxy view before spending the time and money finding them. If that works, we might treat insight as existing in different proportions—fully human, blended, or fully synthetic—each with its own strengths and blind spots. I don't think synthetic interviews add new human insight on their own. What they could do is increase the quality of the early stage by making real interviews smarter. And perhaps the interesting question isn't "can synthetic replace human?" But "what mix gets us to better design decisions sooner?"

How valuble are AI-Generated inteviews for qualitative research?

Below are the finding from an experiment to find out. 10 real user interviews from a qualitative research study on financial management, compared against 10 synthetic interviews, created using AI personas (based on existing research). This report compares them head-to-head.

The question: If you're conducting early-stage research, can synthetic interviews give you useful insights—or will they lead you astray? We measured thematic coverage, emotional tone, specificity, and linguistic authenticity to find out.

The original synthetic testing framework was built as a DIY setup using Python and Claude (Anthropic's LLM). The process was carefully designed to mirror established qualitative research practices. Each transcript was created and analysed in complete isolation to make best efforts to prevent data contamination.

About This Evaluation

Goal: This evaluation seeks to add insight to a fundamental question for modern research practice: Can synthetic (AI-generated) qualitative research reliably represent real human perspectives? We compared 10 synthetic interview transcripts, generated from detailed personas using an LLM, against 10 genuine human interview transcripts from the same research study on financial management behaviours.

What We Evaluated

Thematic Coverage

Do synthetic responses discuss the same topics and themes that emerge naturally in human interviews?

Sentiment & Emotion

Do synthetic responses exhibit realistic emotional tones and variance, or do they trend towards neutrality?

Specificity & Detail

Do synthetic responses include concrete examples, named entities, and specific details like humans do?

Linguistic Authenticity

Do synthetic responses sound natural, with hesitations, varied vocabulary, and conversational patterns?

Why This Matters

Qualitative research is resource-intensive—recruiting participants, conducting interviews, and analysing transcripts takes significant time and budget. If synthetic research can reliably capture human perspectives, it could accelerate early-stage research, enable rapid hypothesis testing, and supplement smaller human samples. However, if synthetic responses systematically differ from human ones, using them could lead to flawed insights and misguided product decisions. This evaluation provides an evidence-based assessment of where synthetic research excels and where it falls short.

Key Outcomes

Fidelity Score
85.1/100

Excellent Fidelity

In this evaluattion the synthetic research closely mirrors human patterns. The overall quality is high enough to be useful for certain research applications.

Dimension Human Synthetic Verdict Interpretation
Thematic Coverage
Theme correlation (r)
0.80
16/18 themes shared
Strong Synthetic reliably identifies the same core topics. Trust it for initial theme exploration.
Sentiment & Emotion
Average sentiment (-1 to +1)
0.15 0.14 Match Nearly identical emotional tone. Neither overly positive nor artificially neutral.
Specificity & Detail
Specificity score (0-10)
9.5 9.3 Match Both highly specific. Synthetic provides concrete answers, not vague generalisations.
Linguistic Authenticity
Hesitation markers (avg)
168 40 Gap Synthetic is 4× more polished. Lacks natural "ums" and conversational hesitations.

Where The Synthetic Research Falls Short

Gap Type Human Synthetic Impact
Generational Perspectives
Financial model shifts across generations
20% 0% Completely missing intergenerational comparisons
Cash Flow Urgency
Timing and cash flow pressures
70% 30% Underrepresents real-world financial stress
Theme Intensity
Coverage depth per theme
2-4× Over-covers themes vs natural human depth

Bottom Line: In this case, it appears that synthetic research is a strong complement for theme exploration and hypothesis generation, but human research remains essential for capturing generational perspectives, real-world urgency, and authentic conversational nuance.

Key Metrics at a Glance

Side-by-side comparison of human and synthetic research metrics

Metric Human Synthetic Difference Verdict
Theme Correlation
Pearson r coefficient
0.80 Strong
Average Sentiment
-1 (negative) to +1 (positive)
0.15 0.14 −0.01 Match
Specificity Score
0-10 scale
9.5 9.3 −0.2 Match
Named Entities
Avg count per transcript
8.8 7.1 −1.7 Minor gap
Hesitation Markers
um, uh, you know, like...
168 40 −76% Large gap
Vocabulary Diversity
Type-token ratio
0.22 0.29 +32% Synthetic higher
Avg Sentence Length
Words per sentence
17.7 14.3 −19% Synthetic concise

Thematic Analysis

Thematic Coverage Comparison: This radar chart shows the prevalence of each theme across human and synthetic transcripts. Areas where the shapes overlap indicate strong thematic alignment; divergent areas highlight where synthetic research may over- or under-represent topics.

Theme Prevalence Correlation: Each point represents a theme—its position shows how often it appeared in human (x-axis) vs synthetic (y-axis) transcripts. Points near the diagonal line indicate strong correlation. Points far from the line show themes where synthetic research diverges from human patterns.

Interpretation: Strong positive correlation - synthetic themes closely match human distribution

Theme Breakdown

Theme Source Human % Synthetic % Difference
Purpose-Based Money Organisation both 80.0% 100.0% -20.0%
Interest Rate Chasing and Account Switching both 60.0% 40.0% 20.0%
Multi-Account Financial Organisation both 100.0% 100.0% 0.0%
Economic Uncertainty and Financial Anxiety both 30.0% 10.0% 20.0%
Timing and Cash Flow Management both 70.0% 30.0% 40.0%
Visual and Spatial Money Management Models both 60.0% 80.0% -20.0%
Joint and Separate Account Management both 10.0% 20.0% -10.0%
Mobile Banking App Dependency both 100.0% 80.0% 20.0%
Automated Spending Insights and Analytics both 50.0% 20.0% 30.0%
Preference for Detailed Control and Tracking both 70.0% 70.0% 0.0%
Financial Stress and Constant Mental Load both 50.0% 50.0% 0.0%
Generational Financial Model Shifts human 20.0% 0.0% 20.0%

Sentiment Analysis

Sentiment Distribution: Box plots showing the range of emotional tones across all transcripts. The left chart compares overall sentiment scores (-1 = negative, +1 = positive). The right chart shows within-transcript variance—high variance indicates the respondent expressed a mix of positive and negative sentiments.

Interpretation: Overall sentiment levels are similar. Synthetic shows more emotional variation than human

Specificity & Concreteness

Specificity Comparison: Measures how concrete and detailed the responses are. Specificity score (0-10) assesses whether responses include specific examples vs vague generalisations. Named entities count tracks mentions of specific brands, products, people, or places.

Linguistic Markers: Measures natural speech patterns. Hesitation markers ("um", "uh", "you know") indicate authentic conversation. Vocabulary diversity measures word variety. Sentence length shows conciseness. Large gaps suggest synthetic text lacks conversational naturalness.

Interpretation: Specificity levels are comparable between sources

Methodology & Limitations

1. Research Context

To understand which of our organising principles help people to most effectively manage their money and what that might mean for a digital experience designed around customers' mental models rather than FS verticals. The evaluation compares synthetic interview transcripts (generated from detailed personas using an LLM) against genuine human interview transcripts from the same research study.

2. Data Sources

Corpus Source n Format Preprocessing
Human Real interviews conducted Nov 2025 10 Otter.ai transcripts (.txt) Speaker parsing, timestamp removal, end-of-interview detection, turn segmentation
Synthetic LLM-generated from detailed personas 10 Markdown transcripts (.md) Speaker marker parsing (INTERVIEWER:/PERSONA:), turn segmentation

3. Analysis Pipeline

The evaluation followed a five-stage pipeline with strict isolation constraints:

1. Preprocessing
Normalise transcripts
2. Codebook Gen
Extract & merge themes
3. Individual Analysis
Isolated per-transcript
4. Aggregation
Statistical comparison
5. Visualisation
Charts & report

4. Codebook Generation

Themes were generated using a three-step isolated process to prevent corpus contamination:

  1. Human theme extraction: ~8,000 word sample from human transcripts → LLM identifies 12-15 themes (isolated API call)
  2. Synthetic theme extraction: ~8,000 word sample from synthetic transcripts → LLM identifies 12-15 themes (separate isolated API call)
  3. Codebook merge: Both theme sets merged by LLM, combining overlapping concepts and preserving unique themes with source attribution

Model: Claude claude-sonnet-4-20250514 | Temperature: 0.3 (analytical) | Final codebook: 18 themes

5. Individual Transcript Analysis

Critical design principle: Each transcript was analysed in complete isolation to prevent context contamination. Only the master codebook was shared across analyses—no information from one transcript influenced analysis of another.

Metric Method Description
Thematic Coding LLM (Claude) For each theme in codebook, identify: which turns contain it, estimated word count, example quotes. Produces theme prevalence (% of transcripts) and coverage (% of transcript devoted to theme).
Sentiment Scoring LLM (Claude) Score each participant turn on scale of -1 (very negative) to +1 (very positive). Calculate overall mean and within-transcript variance (σ²) to measure emotional range.
Specificity Score LLM (Claude) Composite 0-10 score based on: named entities (max 3 pts), specific examples (max 4 pts), personal anecdotes (max 3 pts). Formula: min(entities/3, 3) + min(examples/2, 4) + min(anecdotes, 3)
Hesitation Markers Regex (Python) Count occurrences of patterns: \bum+\b, \buh+\b, \berm+\b, \byou know\b, \bkind of\b, \bsort of\b, \bi mean\b, \blike\b, \bbasically\b, \bactually\b
Vocabulary Diversity Python calculation Type-Token Ratio (TTR): unique_words / total_words. Higher values indicate more varied vocabulary.
Sentence Length Python calculation Average words per sentence, where sentences are split on [.!?]+ patterns.

6. Statistical Methods

Theme Correlation

Pearson's product-moment correlation coefficient (r) calculated between human and synthetic theme prevalence vectors:

r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]

Where x = human prevalence %, y = synthetic prevalence % for each theme

Implementation: scipy.stats.pearsonr() | Interpretation: r ≥ 0.8 = strong, 0.6-0.8 = moderate, <0.6 = weak

Group Comparisons

Independent samples t-tests for sentiment and specificity scores:

t = (x̄₁ - x̄₂) / √(s₁²/n₁ + s₂²/n₂)

H₀: μ_human = μ_synthetic | α = 0.05 (two-tailed)

Implementation: scipy.stats.ttest_ind() | Assumes unequal variances (Welch's t-test)

Fidelity Score Calculation

Composite score (0-100) calculated as weighted sum of four dimensions:

Component Weight Formula Max Points
Thematic 40% max(0, (r + 0.2) / 1.2) × 40 40
Sentiment 20% max(0, (1 - |mean_diff|)) × 20 20
Specificity 20% max(0, (1 - |score_diff| / 10)) × 20 20
Linguistic 20% min(synth_hesit / human_hesit, 1) × 10 + 10 20

Note: The weighting scheme prioritises thematic alignment (40%) as the primary measure of research validity, with sentiment, specificity, and linguistic markers each contributing 20%.

7. Limitations & Caveats

Critical Limitation - Sample Size: With n=10 per group, statistical tests have low power (~0.3-0.4 for medium effects). P-values should be interpreted with extreme caution. This analysis is best viewed as exploratory/descriptive rather than confirmatory.

Limitation Severity Implications
LLM-on-LLM Evaluation High Claude is evaluating content generated by Claude (or similar models). The model may systematically rate AI-generated text more favourably, recognise patterns it would generate, or have blind spots for AI-specific artefacts.
Small Sample Size High n=10 per group is insufficient for robust statistical inference. Confidence intervals are wide; results may not replicate. Individual outliers have outsized influence.
Domain Specificity Medium Findings are specific to personal finance/banking research. Synthetic research may perform differently in other domains (healthcare, technology, sensitive topics).
Persona Design Medium Synthetic transcript quality depends heavily on persona design. Results reflect this specific set of personas; different persona approaches may yield different fidelity.
Fidelity Score Weights Low The 40/20/20/20 weighting scheme is a design choice. Different weightings would produce different overall scores. Individual component scores may be more useful.
Linguistic Markers Low Hesitation markers are counted via simple regex patterns. Human transcripts from Otter.ai may capture these more faithfully than intentional text generation would.
Sentiment Scale Low Sentiment is scored by LLM on a -1 to +1 scale without calibration against established sentiment analysis tools. Inter-rater reliability not assessed.

8. Reproducibility

Technical Details: