£51 million and no receipts: is AI generated research the cost effective speedy option it seems?

Venture capital is pouring into synthetic research tools. I ran a parallel study—human vs synthetic—and the findings complicate the simple narratives on both sides.

Over the past eighteen months, venture capital has poured into tools promising to transform how we understand users. Outset AI alone has raised £51 million across two funding rounds. Synthetic Users claims Gartner recognition and enterprise clients. Microsoft, Nestlé, Uber, and HubSpot are buying in. The pitch is seductive: "Get the depth of qualitative interviews at the speed and scale of a survey." One person doing the work of an entire research team. Eight times faster. Eighty-one per cent cheaper.

And yet, for all this investment and adoption, something is conspicuously absent from the conversation. There are no documented case studies of product decisions that went wrong because of synthetic research. No post-mortems. No longitudinal validation tracking whether AI-generated insights actually led to better outcomes. The industry is adopting at a pace while flying blind.

Can synthetic interviews give us something useful in the early stages of understanding a problem space? Not as a gimmick or a replacement for real conversations, but as a practical part of shaping hypotheses. Recently, I had the right conditions to test this, and the findings complicate the simple narratives on both sides. The full research report—methodology, codebook, and detailed analysis—is available here if you want to go deeper.

The discourse is stuck

The public conversation around synthetic research has calcified into predictable positions. On one side, vendors promise "democratisation" and "depth at scale." Synthetic Users markets itself with the line "user research without the headaches" and claims their AI participants are "human, with cognitive quirks and kinks." Greylock, the VC firm, declares that "research is no longer bottlenecked by calendars, bandwidth, or headcount."

On the other side, respected practitioners have drawn hard lines. Erika Hall, co-founder of Mule Design and author of Just Enough Research, calls it "unethical, indefensible, and also unnecessary, to create a product or service that affects other people, without having conversations with representatives of those populations." Jared Spool describes synthetic user technology as "absolutely the wrong direction for UX professionals to go." Greg Nudelman, writing in his UX for AI newsletter, suggests the whole category "should be abandoned on the trash heap of history."

Nielsen Norman Group has produced the closest thing to an independent evaluation, testing Synthetic Users and ChatGPT against real user studies. Their conclusion lands somewhere cautious: synthetic users might be acceptable for "desk research and generating hypotheses" but are "incredibly risky" for validation. They found that synthetic responses "felt one-dimensional" and noted a troubling pattern: "Real people care about some things more than others. Synthetic users seem to care about everything. This is not helpful for feature prioritisation."

What's missing from both camps is actual evidence from practitioners testing, in real conditions, with real comparisons, and honest limitations. The vendors cite impressive-sounding statistics, without publishing methodology. The critics argue from principle, which is valid, but principle doesn't tell you what to do when the budget won't stretch to twenty interviews, and the roadmap decision is due next week.

Why I built my own rig

To explore this, rather than using one of the existing platforms I built my own setup using code, folders, and an LLM. I wanted to get my fingers into the soil, understand what's actually happening behind the curtain. The SaaS tools are effectively black boxes. You feed in a brief, you get back transcripts or insights, and somewhere in the middle... what, exactly? How much rigour is in there? How much complexity? Or is it just a language model with a nice interface and another monthly subscription?

Going DIY meant I could plan and observe the steps. I could iterate on the prompts, adjust the persona construction, and understand where the process was robust and where it was fragile. Design is a problem-fitting exercise, the value of a flexible setup that adapts to different research questions felt higher than convenience, so long as I could prove it generates something useful.

As luck would have it, I recently delivered a round of qualitative design research with real human participants, 10 interviews exploring how people manage their money. Providing enough background material to build credible synthetic personas. So I created a parallel track: the same discussion guide, run through synthetic participant interviews. Two studies side by side. One human, one simulated.

The 85% finding

At the end of the process, I had two sets of transcripts — ten human, ten synthetic — analysed against the same topic codebook. How closely do they align? Where do they overlap, and where do they diverge? To explore that, I built a composite fidelity score measuring four dimensions: thematic coverage (do they surface the same topics?), sentiment (do they express similar emotional tones?), specificity (do they include concrete details rather than vague generalisations?), and linguistic authenticity (do they sound like real conversation?).

The overall fidelity score was 85.1 out of 100. That number sits in uncomfortable territory: too high to dismiss, too imperfect to trust blindly.

Breaking it down reveals where synthetic research tracks closely and where it falls apart.

Human Synthetic
Theme prevalence correlation: prevalence of each theme across human vs synthetic transcripts. Where the shapes overlap, alignment is strong; where they diverge (e.g. Cash Flow, Rate Chasing), synthetic under- or over-represents. Theme correlation: 0.80. Data from the same evaluation as the full report.

Theme correlation: 0.80, a strong positive correlation. Synthetic interviews identified the same core topics that emerged naturally in human conversations. Sixteen of eighteen themes appeared in both corpora. For initial exploration of a problem space, this suggests synthetic research could reliably surface the territory you'll want to investigate further.

Sentiment: nearly identical; human transcripts averaged 0.15 on a scale from -1 (negative) to +1 (positive); synthetic transcripts averaged 0.14. Neither source was artificially positive or skewed neutral. This challenges the common criticism that AI-generated responses exhibit sycophancy, a tendency to please that produces unrealistically favourable feedback. In this study, at least, that pattern didn't emerge.

Specificity: comparable, both sources scored above 9 out of 10 for concrete detail. Synthetic responses included named entities, specific examples, and personal anecdotes at rates comparable to those of human participants. They weren't vague generalisations dressed up as insight.

But then the gaps.

Hesitation markers: 4x deficit, human transcripts contained an average of 168 hesitation markers per interview (um, uh, you know, sort of, I mean). Synthetic transcripts averaged 40. This isn't trivial. The hesitations, contradictions, and false starts in human speech aren't noise to be cleaned up; they're often where the insight lives. The pause before answering a question about money. The "well, actually..." that reveals someone is about to contradict what they just said. The tangent that surfaces an emotional undercurrent, the discussion guide never anticipated.

The synthetic interviews are confident, fluent, and tidy. Real people are contradictory, uncertain, and occasionally messy. That messiness is a signal.

Human Synthetic
Hesitation markers per interview (um, uh, you know, sort of, I mean). Human transcripts averaged 168; synthetic averaged 40—a 4× deficit where much of the emotional signal lives.

Generational perspectives: completely absent; 20% of human transcripts surfaced intergenerational financial model shifts. How parents and grandparents handled money. How expectations have changed. This theme appeared in zero synthetic transcripts. The personas were detailed, but they existed in a kind of eternal present, without the lived texture of family history and generational comparison.

Cash flow urgency: significantly underrepresented; 70% of human participants discussed timing and cash flow pressures. Only 30% of synthetic responses addressed this theme. The real-world stress of when money arrives versus when bills are due, the juggling, the anxiety, the coping strategies, didn't come through with anything like the same intensity.

Human Synthetic
Where synthetic research falls short: percentage of transcripts that surfaced each theme. Generational perspectives and cash flow urgency are under-represented in synthetic responses.

What the texture difference tells us

These gaps point to something important about what synthetic research can and cannot do. It's not that synthetic interviews are wrong, exactly. The themes are real. The structure is sound. But the texture, the quality of lived experience that makes qualitative research valuable, is flattened.

Consider what the hesitation markers represent. When someone says, "I mean, it's not like I'm bad with money, but..." and trails off, that pause contains information. Maybe shame. Maybe a story they're deciding whether to tell. Maybe a contradiction between how they see themselves and how they actually behave. A synthetic response will give you the content, the theme of financial self-perception, but not the weight of it.

Or consider the generational absence. When a human participant mentions that their dad kept cash in envelopes for different purposes, and they've tried to recreate that system digitally, you're getting a window into how financial behaviour is transmitted and adapted across generations. Synthetic personas, however detailed, don't have parents. They don't have memories of watching someone else manage money. They exist fully-formed, without the sedimentary layers that shape how real people think about these things.

This isn't a failure of the technology, exactly. It's a category difference. Synthetic research gives you the structure of a problem space. Human research gives you the texture. The question isn't which is "right" but whether your research question needs structure, texture, or both.

So what are you paying for?

When you subscribe to a synthetic research tool, you're buying a black box. The marketing promises "most human-like AI participants" and "depth at scale." The testimonials cite impressive alignment percentages. But you can't see the prompts. You can't examine how the personas are constructed. You can't verify whether the analysis pipeline introduces its own biases. You get outputs, and you're asked to trust them.

My experiment produced an 85% fidelity score. That's genuinely useful for certain purposes. But I also know where that number comes from, what it measures, and, crucially, where the gaps are. I know that generational context is missing. I know the urgency of cash flow is underrepresented. I know the hesitation markers that signal emotional weight are smoothed away.

If I were using a commercial platform, would I know any of that? Would the tool tell me that 20% of human transcripts surface intergenerational themes and 0% of synthetic ones do? Or would it just give me confident, fluent insights and leave me to assume they're complete?

This is the position designers and researchers increasingly find themselves in. The budget is tight. The timeline is compressed. The stakeholders want answers. And here's a tool that promises qual at quant speed for a fraction of the cost. The pressure to reach for it is real.

But without transparency about what's happening under the hood, without knowing what the 85% includes and what the missing 15% might contain, how do you make an informed decision about where synthetic research fits in your organisation? How do you know when you're accelerating insight and when you're just generating plausible-sounding text?

The accountability question

Here's what concerns me. Fifty-one million pounds has been invested in Outset alone. Enterprise adoption is accelerating, and Microsoft's Head of Research for AI publicly endorses "AI-augmented research" as "here." The vendor claims are confident: "8x faster," "81% less expensive," "comparable insights."

But no one is tracking outcomes.

There's a conspicuous absence in the discourse around synthetic research: no one is systematically measuring whether decisions made on the basis of AI-generated insights turn out to be good decisions. No longitudinal validation studies. No failure post-mortems. No accountability frameworks for when synthetic research leads product teams astray.

The NN/g evaluation is widely cited, but it's a methodological assessment, not an outcome study. It tells you how synthetic responses compare to human responses in structure and content. It doesn't tell you whether teams that used synthetic research shipped better products, avoided more mistakes, or understood their users more accurately over time.

This matters because the "good enough" argument only holds if we know what "good enough" means in practice. The claim that synthetic research can replace some percentage of human research assumes we can identify which percentage, for which questions, in which contexts. Without outcome data, we're guessing.

My experiment is one data point. Small sample, ten transcripts per group. Specific domain, personal finance and banking. Acknowledged limitations. But it's more than most vendors are offering. And it suggests that the answer to "can synthetic research work?" is more nuanced than either the enthusiasts or the critics allow.

What the industry needs

If synthetic research is going to mature as a methodology, and the investment suggests it will, regardless of practitioner scepticism, the field needs several things it currently lacks.

Published failure cases. Every methodology has failure modes. Surveys can mislead. Ethnography can miss the forest for the trees. Usability testing can optimise for the wrong thing. We learn from documented mistakes. Where are the post-mortems from teams that relied on synthetic research and got burned? Either they don't exist, which would be remarkable given the adoption rate, or no one is publishing them, which suggests a transparency problem.

Longitudinal validation. Do synthetic-informed decisions hold up? If you use AI-generated insights to prioritise a roadmap, does that roadmap perform better than one prioritised without them? This is the basic question, and no one seems to be asking it systematically.

Reporting standards. Qualitative research has established frameworks for methodological transparency, the Standards for Reporting Qualitative Research (SRQR), for instance. Synthetic research has nothing comparable. What constitutes adequate persona construction? How should synthetic findings be labelled when presented to stakeholders? What confidence adjustments should apply? The field operates without guardrails.

Honest conversation about power. "Democratisation" is framed as unambiguously positive by every vendor in this space. But democratisation of what, exactly, and for whom? If synthetic research makes it easier for product managers to bypass research teams, is that democratisation or deskilling? If junior researchers never learn to sit with awkward silences in real interviews because AI handles "the busywork," what happens to design intuition over time? These questions aren't being asked.

The interesting question

The debate has been stuck on "can synthetic replace human?" That's the wrong question. It assumes a binary that doesn't match how research actually works in practice.

The more useful question is: what mix gets us to better design decisions sooner?

My data suggest that synthetic research closely tracks human research in terms of structure, themes, topics, and the broad territory of a problem space. It falls short on texture, hesitations, generational context, and the emotional weight that distinguishes urgent problems from merely interesting ones.

A layered approach might look like this: synthetic research to map the territory, sharpen your hypotheses, and identify the questions worth asking. Human research to understand what it actually feels like to live with the problem you're trying to solve. Neither replaces the other. The proportions shift depending on what you're trying to learn.

That's a more modest claim than the vendors are making. It's also a more optimistic one than the critics allow. But until someone starts tracking outcomes, until we have evidence about what actually works, not just arguments about what should work, we're all still guessing.

The £51 million deserves some receipts.

References