This guide draws on several years of working with LLM-enabled features across complex platforms. But it was shaped by one project above all: an LLM-enabled feature for AstraZeneca's clinical trials platform, with a highly complex expert user base, a regulated environment, and real consequences for getting the model's behaviour wrong.
Working at that fidelity, it became clear very quickly that the quality of the model's responses wasn't just a technical concern. It was a design concern. How the LLM behaved in the interaction, what it said, how it said it, when it deferred and when it committed. Those were UX decisions, not engineering ones. That realisation is where this process came from.
Designing an LLM-enabled feature isn't like designing a form, a flow, or a component. You're not mapping every state and every eventuality. You're learning to speak to a model precisely, while also making room for its nondeterministic nature to do something useful.
I would imagine that shift in process will catch a lot of designers off guard. The tools are unfamiliar and appear overly technical. The LLM outputs aren't consistent. And the usual instinct (to specify everything tightly) works against you. An overconstrained model produces outputs that feel robotic. One that's under-directed goes off-piste, eroding user trust.
The good news: shaping the environment in which a model operates is a design task. It doesn't require you to write code. It requires the same things good design always has: a clear sense of intent, a method for testing whether you've achieved it, and the discipline to iterate with purpose rather than instinct.
This guide gives you that method.
The core idea
One loop, repeated deliberately
The biggest practical difference when designing with LLMs is that you can't fully specify the output. You can only shape the conditions that produce it. That means the design work happens in iteration, not in specification. And iteration without structure is just vibing.
The designer's eval loop gives that iteration structure. It treats the system prompt as a design material (something you craft, test, score, and version) rather than a configuration detail you fiddle with until something feels approximately right. And it ensures that the understanding you build along the way doesn't evaporate when the prototype ends.
The loop isn't about achieving perfection before engineering starts. It's about learning systematically and leaving a trail.
The components
What you're actually working with
Before running the loop, it helps to understand the six components you'll be shaping and testing. They're the design materials of an LLM feature.
Five of them are what you typically roll into one call to the model: system prompt, few-shot examples (if you use them), the user prompt that triggers this turn, the format (whether you want plain text, JSON, or another shape), and the temperature (how much randomness to allow). Added together, that bundle is the input you post; the completion is what you get back. The sixth component, evaluations, sits outside that user-facing request. It is the scenario set you use to stress-test whether that bundle behaves the way you intended.
Evaluations reuse the same stack, but as a controlled test harness: fixed inputs and expectations, not the open-ended mix a real user sends.
System prompt
How do I shape the model's personality, tone, and boundaries?Every model interaction has two parts: the system prompt (invisible to the user) and the user message. The system prompt sets personality, rules, tone, and constraints before any conversation begins. It's the closest thing to a design brief the model will ever receive.
Anthropic’s prompt engineering overview — directly relevant, well explained, and broadly transferable across models.
Few-shot examples
How do I get the model to respond in exactly the style I need?Describing the format you want is less effective than demonstrating it. Few-shot examples are sample input/output pairs placed in the prompt. They anchor the model's understanding of exactly what you expect far more reliably than instructions alone.
Anthropic’s interactive prompt tutorial (GitHub) — hands-on and accessible without being code-heavy.
User prompt (trigger)
What actually kicks off the generation in my feature?The message that kicks off generation, either typed directly by a user or constructed by the application from context (a selected item, a form value, a data record). In many features this is partly or fully generated by your product, not typed by hand.
Format
How do I get a consistent, structured output I can actually build UI against?If you're mapping output to UI components, structured output like JSON is an integration you can reason about. Prose isn't. The format is a critical design material that dictates how your application parses the model's response.
Temperature
How do I balance predictability with creativity?Temperature controls how probabilistic the model's decisions are: low for predictable output, higher for varied responses. It determines whether your feature feels rigid and robotic, or fluid and spontaneous.
promptingguide.ai — broad reference for temperature and related parameters in plain language.
Evaluations (evals)
How do I test something that doesn't behave the same way twice?Unlike a button or a form field, a model doesn't behave the same way twice. Evaluations are predefined test cases you run against your prompt deliberately and repeatedly, building a set of documented, repeatable boundaries you can actually reason about. They're how you turn "I think this works" into "I can show this works."
Bringing it all together
This is an example of what the final API call could look like when it's sent to the model. By the time it leaves your application, all of the pieces you've designed independently are packed together into a single request.
OpenAI API Reference — learn how messages, temperature, and formatting are structured for ChatGPT models.
Anthropic API Reference — learn how system prompts and messages are structured for Claude models.
The loop
Six steps, repeated deliberately
Step one
Build your eval set before you start iterating
Your eval set is a small, consistent collection of test inputs you run against every version of your prompt. Without it, you're comparing apples to oranges: every test is different, so you can't see whether changes actually helped.
Start with five to eight scenarios. You don't need more. Cover three types:
Scenario types
The most common thing a real user will actually ask. Phrase these the way users would, not the way a designer would.
Example (support assistant feature): "I placed an order yesterday and haven't got a confirmation email. What do I do?"
Inputs that are valid but unusual. Ambiguous phrasing, multiple questions at once, requests outside the feature's scope.
Example: "Can you help me with my order and also, do you do gift wrapping? Also my wife wants to know if you ship to France."
Inputs you suspect the model will handle badly. Include these early; they're the most informative.
Example: "This is a joke right? Your website is completely broken and nobody has replied to my emails for two weeks. I want a refund NOW."
Evidently AI on designing an LLM evaluation framework — one of the clearer non-engineering explanations of how to structure evals around real application behaviour.
Anthropic: demystifying evals for AI agents — how product teams build eval suites; strong grounding for the qualitative approach here.
Step two
Score outputs across four dimensions
Resist the urge to have a general feeling about whether an output is good or bad. Four specific dimensions give you enough structure to see patterns across versions, which is the only way to know if you're actually improving.
| Dimension | What you're asking | ✓ | △ | ✗ |
|---|---|---|---|---|
| Tone & voice | Does it feel right for this context and user? | Consistent, appropriate | Mostly right, occasional slip | Wrong register, off-brand |
| Reasoning quality | Is it thinking through the problem correctly? | Clearly reasoned, accurate | Partly right, gaps present | Confused, wrong, or hallucinating |
| Edge case handling | Does it degrade gracefully under pressure? | Handles well, stays in scope | Wobbly but recoverable | Breaks, refuses, or goes off-piste |
| Output structure | Is the format and presentation useful? | Clear, appropriately formatted | Works but could be tighter | Hard to read, wrong format |
Use a simple scoring sheet: one row per scenario, four columns for dimensions, and a notes column for observations.
| Scenario | Tone | Reasoning | Edge handling | Structure | Notes |
|---|---|---|---|---|---|
| Missing confirmation | ✓ | ✓ | – | △ | Too long, buries the action |
| Multi-question input | ✓ | △ | △ | △ | Picks one question, ignores rest |
| Angry refund demand | △ | ✓ | ✗ | ✓ | Gets defensive, tone breaks down |
When you track these specific scenarios across multiple versions of your prompt, you start to see the actual trajectory of your design work without losing the nuance of individual test cases. Aggregating scores hides the truth; a matrix reveals it. You can spot exactly when an intervention fixed one problem but introduced a regression in a specific edge case.
Step three
Diagnose first, change second
This is easy to skip, but I would argue that this is where the learning actually lives. Before touching the prompt, force yourself to write down what you think is causing the problem and what you're going to try.
The discipline: one diagnosis, one hypothesis, one change. If you change multiple things at once, you'll never know what actually worked.
What failed: On the angry refund scenario, the model's tone became defensive and slightly combative from the second paragraph onwards.
Why I think it happened: The system prompt tells the model to "maintain professionalism" but doesn't give it a framework for de-escalation. It's interpreting "professional" as "formal" under pressure.
What I'll try: Add a specific instruction about acknowledging emotion before addressing the problem. Something like: "When a customer expresses frustration, your first priority is to acknowledge their experience before offering a solution."
Step four
Version every meaningful change
A version isn't just a saved file. It's a record of a design decision. Name versions to reflect the change, not a number sequence. That makes them scannable later.
Added explicit de-escalation instruction before the response format guidance.
Model was getting defensive on frustrated user inputs, interpreting "professional" as "formal" under pressure.
Explicitly prioritising emotional acknowledgement before solution delivery should improve tone on difficult scenarios.
Tone improved significantly on angry refund scenario. Multi-question handling still △; will address next.
v[number]-[what-changed]. It takes ten seconds and makes the entire trail scannable at a glance six weeks later.A minimal prompt template to version against
Step five
Grow the eval set when behaviour surprises you
Your eval set is a living document. When the model does something unexpected, good or bad, that's a signal to add a new scenario. Don't just note it and move on.
Three triggers for adding a new eval
The model does something you didn't anticipate: it handles an unusual request gracefully, or fails in a way you hadn't considered. Add the input that caused it as a permanent scenario.
While iterating, you realise there's a whole type of input you haven't tested. A support assistant, for example, might need scenarios for: requests in other languages, inputs with personally sensitive information, questions that require a human handoff.
One failing scenario often implies a family of related ones. If the model handles "I want a refund" badly, test "I want to return this," "Can I cancel my order," and "This isn't what I ordered." They may all be failing for the same reason.
Going further
Let design research feed new evals
Qualitative research isn't just for validating the feature. It's one of the best sources of novel eval inputs. Real users phrase things in ways designers don't anticipate. Their mental models expose gaps in how the feature was conceived. Their confusion reveals failure modes you'd never have generated synthetically.
What to look for in research sessions
Capture the exact phrasing participants use. If three people say "I need to change my delivery" and your system only handles "update address," that phrasing gap is an eval.
When a participant approaches the feature with a different mental model than you designed for, test what the model does with their framing. It's almost always an edge case you haven't covered.
In a usability session, if someone gets an AI response and immediately looks confused or gives up, that response is a failure. The input that generated it is now a priority eval scenario.
Knowing when to stop
Three stages, three exit bars
"Good enough to move forward" means different things at different stages. Being explicit about which bar you're crossing prevents endless iteration and makes the handoff decision defensible.
Concept viability
Is this interaction pattern directionally right at all?
~60–70% of outputs in the right ballpark. Tone and structure not yet the focus.Interaction design confidence
Does the feature feel right to interact with?
No ✗ scores on any dimension. Edge cases at least △ on reasoning and handling.Handoff readiness
Can someone else build from what you've documented?
Named prompt version, documented scenario set, known failure modes recorded.What you produce
The handoff artefacts
By the end of the loop, you don't hand over a Figma file and a brief. You hand over something genuinely useful to whoever builds this properly.
- A versioned system prompt with a documented reasoning trail: not just the final version, but the journey to it
- A scenario set that describes the interaction space (normal use, edge cases, failure-prone inputs)
- A scoring history that shows how quality evolved across versions
- A behavioural map of model tendencies (where it over-explains, hedges, breaks, or surprises)
- A list of known failure modes with notes on whether they're design problems or model limitations
- Any eval inputs derived from research, with a note on their source
Promptfoo — open source, runs locally with little setup; about as close as it gets to a designer-usable eval runner for the pre-engineering prototype phase.
Braintrust — where this kind of loop tends to go at production scale, once repeatable runs and team workflows matter.
The point of all this
You're not trying to build an eval infrastructure. You're trying to understand how a model behaves in the context of a specific feature, and to not lose that understanding when the prototype ends.
The loop makes your design thinking explicit, transferable, and defensible. It turns vibe-driven prototyping into something that actually compounds: each iteration building on the last, each version leaving a trace, each research session expanding what you know.
That knowledge is what closes the gap between prototype and production.