Comparative analysis of qualitative research transcripts
Over the past year, I've been circling a question that feels important for where design and research are heading: can synthetic interviews give us something useful in the early stages of understanding a problem space? Not as a gimmick, and not as a replacement for real conversations, but as a practical part of shaping hypotheses. Recently, I had the right conditions to try this properly. I was already running research with real participants, and I had enough background material to build credible personas. So I created a parallel synthetic track, running the same discussion guide through my own setup, transcript by transcript. Two studies side by side—one human, one simulated. I built my own rig rather than using one of the platforms that have emerged in this space. I wanted to get my fingers into the soil—understand the pitfalls and benefits directly—especially since design is such a fluid, problem-fitting exercise. The value of a flexible setup that can adapt to different test types felt higher than convenience, so long as I could prove it generates valuable input.
This past weekend I cleaned the data and compared the two sets, building a Python-generated report looking at thematic coverage, emotional tone, specificity, and linguistic markers worth paying attention to. The point wasn't to prove anything definitive, but to see whether there's enough correlation to argue there's value here for designers who want to move faster without cutting corners.
I went in with a strong view: synthetic research isn't a replacement for talking to real people, but it might widen the aperture of what we already think we know. There's always that early stage where the team operates on half-formed assumptions about motivations, barriers, and emotional patterns. We chip away at those a few interviews at a time, refining questions as we go. What I wanted to test was whether synthetic interviews, grounded in good persona research, could accelerate that exploration—surfacing themes in advance, pressure-testing hypotheses so the real qualitative work becomes more focused.
What stands out is that synthetic interviews track the core thematic structure surprisingly closely, but with a different texture. They're confident, fluent, tidy. Real people are contradictory, uncertain, occasionally messy. So the role I see emerging isn't synthetic versus human, but layered. Start with a hypothesis, build personas from what you know, run a synthetic round to see where the cracks form, then sharpen your questions for real conversations. That feels closer to quantitative-before-qualitative thinking, except through narrative instead of numbers. There's also the possibility of using synthetic interviews to simulate groups that are harder to recruit, creating a proxy view before spending the time and money finding them. If that works, we might treat insight as existing in different proportions—fully human, blended, or fully synthetic—each with its own strengths and blind spots. I don't think synthetic interviews add new human insight on their own. What they could do is increase the quality of the early stage by making real interviews smarter. And perhaps the interesting question isn't "can synthetic replace human?" But "what mix gets us to better design decisions sooner?"
Below are the finding from an experiment to find out. 10 real user interviews from a qualitative research study on financial management, compared against 10 synthetic interviews, created using AI personas (based on existing research). This report compares them head-to-head.
The question: If you're conducting early-stage research, can synthetic interviews give you useful insights—or will they lead you astray? We measured thematic coverage, emotional tone, specificity, and linguistic authenticity to find out.
The original synthetic testing framework was built as a DIY setup using Python and Claude (Anthropic's LLM). The process was carefully designed to mirror established qualitative research practices. Each transcript was created and analysed in complete isolation to make best efforts to prevent data contamination.
Goal: This evaluation seeks to add insight to a fundamental question for modern research practice: Can synthetic (AI-generated) qualitative research reliably represent real human perspectives? We compared 10 synthetic interview transcripts, generated from detailed personas using an LLM, against 10 genuine human interview transcripts from the same research study on financial management behaviours.
Do synthetic responses discuss the same topics and themes that emerge naturally in human interviews?
Do synthetic responses exhibit realistic emotional tones and variance, or do they trend towards neutrality?
Do synthetic responses include concrete examples, named entities, and specific details like humans do?
Do synthetic responses sound natural, with hesitations, varied vocabulary, and conversational patterns?
Qualitative research is resource-intensive—recruiting participants, conducting interviews, and analysing transcripts takes significant time and budget. If synthetic research can reliably capture human perspectives, it could accelerate early-stage research, enable rapid hypothesis testing, and supplement smaller human samples. However, if synthetic responses systematically differ from human ones, using them could lead to flawed insights and misguided product decisions. This evaluation provides an evidence-based assessment of where synthetic research excels and where it falls short.
Excellent Fidelity
In this evaluattion the synthetic research closely mirrors human patterns. The overall quality is high enough to be useful for certain research applications.
| Dimension | Human | Synthetic | Verdict | Interpretation |
|---|---|---|---|---|
|
Thematic Coverage
Theme correlation (r)
|
0.80
16/18 themes shared
|
Strong | Synthetic reliably identifies the same core topics. Trust it for initial theme exploration. | |
|
Sentiment & Emotion
Average sentiment (-1 to +1)
|
0.15 | 0.14 | Match | Nearly identical emotional tone. Neither overly positive nor artificially neutral. |
|
Specificity & Detail
Specificity score (0-10)
|
9.5 | 9.3 | Match | Both highly specific. Synthetic provides concrete answers, not vague generalisations. |
|
Linguistic Authenticity
Hesitation markers (avg)
|
168 | 40 | Gap | Synthetic is 4× more polished. Lacks natural "ums" and conversational hesitations. |
| Gap Type | Human | Synthetic | Impact |
|---|---|---|---|
|
Generational Perspectives
Financial model shifts across generations
|
20% | 0% | Completely missing intergenerational comparisons |
|
Cash Flow Urgency
Timing and cash flow pressures
|
70% | 30% | Underrepresents real-world financial stress |
|
Theme Intensity
Coverage depth per theme
|
1× | 2-4× | Over-covers themes vs natural human depth |
Bottom Line: In this case, it appears that synthetic research is a strong complement for theme exploration and hypothesis generation, but human research remains essential for capturing generational perspectives, real-world urgency, and authentic conversational nuance.
Side-by-side comparison of human and synthetic research metrics
| Metric | Human | Synthetic | Difference | Verdict |
|---|---|---|---|---|
|
Theme Correlation
Pearson r coefficient
|
0.80 | — | Strong | |
|
Average Sentiment
-1 (negative) to +1 (positive)
|
0.15 | 0.14 | −0.01 | Match |
|
Specificity Score
0-10 scale
|
9.5 | 9.3 | −0.2 | Match |
|
Named Entities
Avg count per transcript
|
8.8 | 7.1 | −1.7 | Minor gap |
|
Hesitation Markers
um, uh, you know, like...
|
168 | 40 | −76% | Large gap |
|
Vocabulary Diversity
Type-token ratio
|
0.22 | 0.29 | +32% | Synthetic higher |
|
Avg Sentence Length
Words per sentence
|
17.7 | 14.3 | −19% | Synthetic concise |
Thematic Coverage Comparison: This radar chart shows the prevalence of each theme across human and synthetic transcripts. Areas where the shapes overlap indicate strong thematic alignment; divergent areas highlight where synthetic research may over- or under-represent topics.
Theme Prevalence Correlation: Each point represents a theme—its position shows how often it appeared in human (x-axis) vs synthetic (y-axis) transcripts. Points near the diagonal line indicate strong correlation. Points far from the line show themes where synthetic research diverges from human patterns.
Interpretation: Strong positive correlation - synthetic themes closely match human distribution
| Theme | Source | Human % | Synthetic % | Difference |
|---|---|---|---|---|
| Purpose-Based Money Organisation | both | 80.0% | 100.0% | -20.0% |
| Interest Rate Chasing and Account Switching | both | 60.0% | 40.0% | 20.0% |
| Multi-Account Financial Organisation | both | 100.0% | 100.0% | 0.0% |
| Economic Uncertainty and Financial Anxiety | both | 30.0% | 10.0% | 20.0% |
| Timing and Cash Flow Management | both | 70.0% | 30.0% | 40.0% |
| Visual and Spatial Money Management Models | both | 60.0% | 80.0% | -20.0% |
| Joint and Separate Account Management | both | 10.0% | 20.0% | -10.0% |
| Mobile Banking App Dependency | both | 100.0% | 80.0% | 20.0% |
| Automated Spending Insights and Analytics | both | 50.0% | 20.0% | 30.0% |
| Preference for Detailed Control and Tracking | both | 70.0% | 70.0% | 0.0% |
| Financial Stress and Constant Mental Load | both | 50.0% | 50.0% | 0.0% |
| Generational Financial Model Shifts | human | 20.0% | 0.0% | 20.0% |
Sentiment Distribution: Box plots showing the range of emotional tones across all transcripts. The left chart compares overall sentiment scores (-1 = negative, +1 = positive). The right chart shows within-transcript variance—high variance indicates the respondent expressed a mix of positive and negative sentiments.
Interpretation: Overall sentiment levels are similar. Synthetic shows more emotional variation than human
Specificity Comparison: Measures how concrete and detailed the responses are. Specificity score (0-10) assesses whether responses include specific examples vs vague generalisations. Named entities count tracks mentions of specific brands, products, people, or places.
Linguistic Markers: Measures natural speech patterns. Hesitation markers ("um", "uh", "you know") indicate authentic conversation. Vocabulary diversity measures word variety. Sentence length shows conciseness. Large gaps suggest synthetic text lacks conversational naturalness.
Interpretation: Specificity levels are comparable between sources
To understand which of our organising principles help people to most effectively manage their money and what that might mean for a digital experience designed around customers' mental models rather than FS verticals. The evaluation compares synthetic interview transcripts (generated from detailed personas using an LLM) against genuine human interview transcripts from the same research study.
| Corpus | Source | n | Format | Preprocessing |
|---|---|---|---|---|
| Human | Real interviews conducted Nov 2025 | 10 | Otter.ai transcripts (.txt) | Speaker parsing, timestamp removal, end-of-interview detection, turn segmentation |
| Synthetic | LLM-generated from detailed personas | 10 | Markdown transcripts (.md) | Speaker marker parsing (INTERVIEWER:/PERSONA:), turn segmentation |
The evaluation followed a five-stage pipeline with strict isolation constraints:
Themes were generated using a three-step isolated process to prevent corpus contamination:
Model: Claude claude-sonnet-4-20250514 | Temperature: 0.3 (analytical) | Final codebook: 18 themes
Critical design principle: Each transcript was analysed in complete isolation to prevent context contamination. Only the master codebook was shared across analyses—no information from one transcript influenced analysis of another.
| Metric | Method | Description |
|---|---|---|
| Thematic Coding | LLM (Claude) | For each theme in codebook, identify: which turns contain it, estimated word count, example quotes. Produces theme prevalence (% of transcripts) and coverage (% of transcript devoted to theme). |
| Sentiment Scoring | LLM (Claude) | Score each participant turn on scale of -1 (very negative) to +1 (very positive). Calculate overall mean and within-transcript variance (σ²) to measure emotional range. |
| Specificity Score | LLM (Claude) | Composite 0-10 score based on: named entities (max 3 pts), specific examples (max 4 pts), personal anecdotes (max 3 pts). Formula: min(entities/3, 3) + min(examples/2, 4) + min(anecdotes, 3) |
| Hesitation Markers | Regex (Python) | Count occurrences of patterns: \bum+\b, \buh+\b, \berm+\b, \byou know\b, \bkind of\b, \bsort of\b, \bi mean\b, \blike\b, \bbasically\b, \bactually\b |
| Vocabulary Diversity | Python calculation | Type-Token Ratio (TTR): unique_words / total_words. Higher values indicate more varied vocabulary. |
| Sentence Length | Python calculation | Average words per sentence, where sentences are split on [.!?]+ patterns. |
Pearson's product-moment correlation coefficient (r) calculated between human and synthetic theme prevalence vectors:
Implementation: scipy.stats.pearsonr() | Interpretation: r ≥ 0.8 = strong, 0.6-0.8 = moderate, <0.6 = weak
Independent samples t-tests for sentiment and specificity scores:
Implementation: scipy.stats.ttest_ind() | Assumes unequal variances (Welch's t-test)
Composite score (0-100) calculated as weighted sum of four dimensions:
| Component | Weight | Formula | Max Points |
|---|---|---|---|
| Thematic | 40% | max(0, (r + 0.2) / 1.2) × 40 |
40 |
| Sentiment | 20% | max(0, (1 - |mean_diff|)) × 20 |
20 |
| Specificity | 20% | max(0, (1 - |score_diff| / 10)) × 20 |
20 |
| Linguistic | 20% | min(synth_hesit / human_hesit, 1) × 10 + 10 |
20 |
Note: The weighting scheme prioritises thematic alignment (40%) as the primary measure of research validity, with sentiment, specificity, and linguistic markers each contributing 20%.
Critical Limitation - Sample Size: With n=10 per group, statistical tests have low power (~0.3-0.4 for medium effects). P-values should be interpreted with extreme caution. This analysis is best viewed as exploratory/descriptive rather than confirmatory.
| Limitation | Severity | Implications |
|---|---|---|
| LLM-on-LLM Evaluation | High | Claude is evaluating content generated by Claude (or similar models). The model may systematically rate AI-generated text more favourably, recognise patterns it would generate, or have blind spots for AI-specific artefacts. |
| Small Sample Size | High | n=10 per group is insufficient for robust statistical inference. Confidence intervals are wide; results may not replicate. Individual outliers have outsized influence. |
| Domain Specificity | Medium | Findings are specific to personal finance/banking research. Synthetic research may perform differently in other domains (healthcare, technology, sensitive topics). |
| Persona Design | Medium | Synthetic transcript quality depends heavily on persona design. Results reflect this specific set of personas; different persona approaches may yield different fidelity. |
| Fidelity Score Weights | Low | The 40/20/20/20 weighting scheme is a design choice. Different weightings would produce different overall scores. Individual component scores may be more useful. |
| Linguistic Markers | Low | Hesitation markers are counted via simple regex patterns. Human transcripts from Otter.ai may capture these more faithfully than intentional text generation would. |
| Sentiment Scale | Low | Sentiment is scored by LLM on a -1 to +1 scale without calibration against established sentiment analysis tools. Inter-rater reliability not assessed. |
Technical Details:
results/evaluation_*/individual_analyses/