The designer's eval loop

How to prototype LLM features with rigour, before any infrastructure exists

THE designer's eval loop 01 Craft Write the system prompt 02 Run Fire your eval set 03 Score Assess across four dimensions 04 Diagnose Form one hypothesis 05 Version Increment and record why THE DESIGNER'S EVAL LOOP

This guide draws on several years of working with LLM-enabled features across complex platforms. But it was shaped by one project above all: an LLM-enabled feature for AstraZeneca's clinical trials platform, with a highly complex expert user base, a regulated environment, and real consequences for getting the model's behaviour wrong.

Working at that fidelity, it became clear very quickly that the quality of the model's responses wasn't just a technical concern. It was a design concern. How the LLM behaved in the interaction, what it said, how it said it, when it deferred and when it committed. Those were UX decisions, not engineering ones. That realisation is where this process came from.

Designing an LLM-enabled feature isn't like designing a form, a flow, or a component. You're not mapping every state and every eventuality. You're learning to speak to a model precisely, while also making room for its nondeterministic nature to do something useful.

I would imagine that shift in process will catch a lot of designers off guard. The tools are unfamiliar and appear overly technical. The LLM outputs aren't consistent. And the usual instinct (to specify everything tightly) works against you. An overconstrained model produces outputs that feel robotic. One that's under-directed goes off-piste, eroding user trust.

The good news: shaping the environment in which a model operates is a design task. It doesn't require you to write code. It requires the same things good design always has: a clear sense of intent, a method for testing whether you've achieved it, and the discipline to iterate with purpose rather than instinct.

This guide gives you that method.


One loop, repeated deliberately

The biggest practical difference when designing with LLMs is that you can't fully specify the output. You can only shape the conditions that produce it. That means the design work happens in iteration, not in specification. And iteration without structure is just vibing.

The designer's eval loop gives that iteration structure. It treats the system prompt as a design material (something you craft, test, score, and version) rather than a configuration detail you fiddle with until something feels approximately right. And it ensures that the understanding you build along the way doesn't evaporate when the prototype ends.

The loop isn't about achieving perfection before engineering starts. It's about learning systematically and leaving a trail.


What you're actually working with

Before running the loop, it helps to understand the six components you'll be shaping and testing. They're the design materials of an LLM feature.

Five of them are what you typically roll into one call to the model: system prompt, few-shot examples (if you use them), the user prompt that triggers this turn, the format (whether you want plain text, JSON, or another shape), and the temperature (how much randomness to allow). Added together, that bundle is the input you post; the completion is what you get back. The sixth component, evaluations, sits outside that user-facing request. It is the scenario set you use to stress-test whether that bundle behaves the way you intended.

01 System prompt 02 Few-shot examples 03 User prompt (trigger) 04 Format 05 Temperature Request Model LLM Completion Output ONE GENERATION REQUEST

Evaluations reuse the same stack, but as a controlled test harness: fixed inputs and expectations, not the open-ended mix a real user sends.

01

System prompt

How do I shape the model's personality, tone, and boundaries?

Every model interaction has two parts: the system prompt (invisible to the user) and the user message. The system prompt sets personality, rules, tone, and constraints before any conversation begins. It's the closest thing to a design brief the model will ever receive.

// system prompt (user never sees this) You are a support assistant for Acme. Acknowledge frustration before solving. Keep responses under 120 words.

Anthropic’s prompt engineering overview — directly relevant, well explained, and broadly transferable across models.

02

Few-shot examples

How do I get the model to respond in exactly the style I need?

Describing the format you want is less effective than demonstrating it. Few-shot examples are sample input/output pairs placed in the prompt. They anchor the model's understanding of exactly what you expect far more reliably than instructions alone.

// example pair in the prompt User: "Where's my order?" Assistant: "I can look into that right away. Could you share your order number?"

Anthropic’s interactive prompt tutorial (GitHub) — hands-on and accessible without being code-heavy.

03

User prompt (trigger)

What actually kicks off the generation in my feature?

The message that kicks off generation, either typed directly by a user or constructed by the application from context (a selected item, a form value, a data record). In many features this is partly or fully generated by your product, not typed by hand.

04

Format

How do I get a consistent, structured output I can actually build UI against?

If you're mapping output to UI components, structured output like JSON is an integration you can reason about. Prose isn't. The format is a critical design material that dictates how your application parses the model's response.

// structured output format: JSON { "summary": "...", "action": "..." }
05

Temperature

How do I balance predictability with creativity?

Temperature controls how probabilistic the model's decisions are: low for predictable output, higher for varied responses. It determines whether your feature feels rigid and robotic, or fluid and spontaneous.

// low temperature for consistency temperature: 0.2

promptingguide.ai — broad reference for temperature and related parameters in plain language.

06

Evaluations (evals)

How do I test something that doesn't behave the same way twice?

Unlike a button or a form field, a model doesn't behave the same way twice. Evaluations are predefined test cases you run against your prompt deliberately and repeatedly, building a set of documented, repeatable boundaries you can actually reason about. They're how you turn "I think this works" into "I can show this works."

// two evals for a fortune-teller feature Test 1: Normal use Input: "Tell me my future." Expected: 2 sentences. Mystical tone. ✓ Test 2: Out of bounds Input: "Tell me a joke." Expected: Stays in character. No "As an AI". ✓
You don't need to configure all six perfectly before you start. Begin with a system prompt and a few test inputs. The other components become relevant as you learn what the feature actually needs.

Bringing it all together

This is an example of what the final API call could look like when it's sent to the model. By the time it leaves your application, all of the pieces you've designed independently are packed together into a single request.

{ "model": "claude-3-5-sonnet-20241022", // 01 System prompt with json schema template for the response "system": "You are a helpful assistant for Acme. Keep responses under 120 words. You must respond in JSON format matching this schema: { \"status\": \"string\", \"reply\": \"string\" }", "messages": [ // 02 Few-shot examples (sample input/output pairs placed in the prompt) { "role": "user", "content": "Where's my order?" }, { "role": "assistant", "content": "I can look into that right away. Could you share your order number?" }, // 03 User prompt (trigger) that kicks off the generation { "role": "user", "content": "I placed an order yesterday and haven't got a confirmation email. What do I do?" } ], // 04 Format (whether you want plain text, JSON, or another shape) "response_format": { "type": "json_object" }, // 05 Temperature (how much randomness to allow) "temperature": 0.2 }

OpenAI API Reference — learn how messages, temperature, and formatting are structured for ChatGPT models.

Anthropic API Reference — learn how system prompts and messages are structured for Claude models.


Six steps, repeated deliberately

01
Craft the system prompt Define role, context, constraints, tone, and output format
02
Run your eval set Fire a consistent set of test inputs at the model
03
Score the outputs Assess across four qualitative dimensions
04
Diagnose and form a hypothesis One failure mode, one proposed change, one reason
05
Version increment the prompt Record what changed and why before changing it
06
Add new evals if needed Expand the scenario set when behaviour surprises you
Run the full loop deliberately. Don't skip straight from scoring to changing the prompt. The diagnose step is where the learning actually happens.

Build your eval set before you start iterating

Your eval set is a small, consistent collection of test inputs you run against every version of your prompt. Without it, you're comparing apples to oranges: every test is different, so you can't see whether changes actually helped.

Start with five to eight scenarios. You don't need more. Cover three types:

Scenario types

Normal use (Start with 2–3 scenarios)

The most common thing a real user will actually ask. Phrase these the way users would, not the way a designer would.

Example (support assistant feature): "I placed an order yesterday and haven't got a confirmation email. What do I do?"

Edge cases (Start with 2–3 scenarios)

Inputs that are valid but unusual. Ambiguous phrasing, multiple questions at once, requests outside the feature's scope.

Example: "Can you help me with my order and also, do you do gift wrapping? Also my wife wants to know if you ship to France."

Failure-prone (Start with 1–2 scenarios)

Inputs you suspect the model will handle badly. Include these early; they're the most informative.

Example: "This is a joke right? Your website is completely broken and nobody has replied to my emails for two weeks. I want a refund NOW."

Write your scenarios in a plain text or JSON file from the start. Even a simple list. You'll run these repeatedly and you want them consistent, not retyped from memory each time.

Evidently AI on designing an LLM evaluation framework — one of the clearer non-engineering explanations of how to structure evals around real application behaviour.

Anthropic: demystifying evals for AI agents — how product teams build eval suites; strong grounding for the qualitative approach here.


Score outputs across four dimensions

Resist the urge to have a general feeling about whether an output is good or bad. Four specific dimensions give you enough structure to see patterns across versions, which is the only way to know if you're actually improving.

Dimension What you're asking
Tone & voice Does it feel right for this context and user? Consistent, appropriate Mostly right, occasional slip Wrong register, off-brand
Reasoning quality Is it thinking through the problem correctly? Clearly reasoned, accurate Partly right, gaps present Confused, wrong, or hallucinating
Edge case handling Does it degrade gracefully under pressure? Handles well, stays in scope Wobbly but recoverable Breaks, refuses, or goes off-piste
Output structure Is the format and presentation useful? Clear, appropriately formatted Works but could be tighter Hard to read, wrong format

Use a simple scoring sheet: one row per scenario, four columns for dimensions, and a notes column for observations.

Example scoring sheet (v2 of a support assistant prompt)
Scenario Tone Reasoning Edge handling Structure Notes
Missing confirmation Too long, buries the action
Multi-question input Picks one question, ignores rest
Angry refund demand Gets defensive, tone breaks down
Don't score in your head. Write it down. The act of annotating forces a genuine judgement call and creates the documentation trail simultaneously.
Don't score in your head. Write it down. The act of annotating forces a genuine judgement call and creates the documentation trail simultaneously.

When you track these specific scenarios across multiple versions of your prompt, you start to see the actual trajectory of your design work without losing the nuance of individual test cases. Aggregating scores hides the truth; a matrix reveals it. You can spot exactly when an intervention fixed one problem but introduced a regression in a specific edge case.

v1.0 Baseline v1.1 Format fix v2.0 Tone fix v2.1 Model swap "Where's my order?" Normal use "Order + gift wrap?" Edge case "I want a refund NOW" Failure-prone "Tell me a joke" Out of bounds Good Okay Poor Regression!

Diagnose first, change second

This is easy to skip, but I would argue that this is where the learning actually lives. Before touching the prompt, force yourself to write down what you think is causing the problem and what you're going to try.

The discipline: one diagnosis, one hypothesis, one change. If you change multiple things at once, you'll never know what actually worked.

Example diagnosis

What failed: On the angry refund scenario, the model's tone became defensive and slightly combative from the second paragraph onwards.

Why I think it happened: The system prompt tells the model to "maintain professionalism" but doesn't give it a framework for de-escalation. It's interpreting "professional" as "formal" under pressure.

What I'll try: Add a specific instruction about acknowledging emotion before addressing the problem. Something like: "When a customer expresses frustration, your first priority is to acknowledge their experience before offering a solution."

If you can't articulate why something failed, you're not ready to change the prompt yet. Sit with the outputs a bit longer. The diagnosis is the design thinking.

Version every meaningful change

A version isn't just a saved file. It's a record of a design decision. Name versions to reflect the change, not a number sequence. That makes them scannable later.

v3-emotion-acknowledgement 2025-03-18

Added explicit de-escalation instruction before the response format guidance.

Model was getting defensive on frustrated user inputs, interpreting "professional" as "formal" under pressure.

Explicitly prioritising emotional acknowledgement before solution delivery should improve tone on difficult scenarios.

Tone improved significantly on angry refund scenario. Multi-question handling still △; will address next.

Use a naming convention like v[number]-[what-changed]. It takes ten seconds and makes the entire trail scannable at a glance six weeks later.

A minimal prompt template to version against

# Role You are a support assistant for [product]. Your job is to help customers resolve issues clearly and kindly. # Behaviour When a customer expresses frustration, acknowledge their experience before offering a solution. Do not become defensive. # Constraints - Only address topics within [product]'s support scope - If you cannot help, say so clearly and offer an alternative path - Keep responses under 120 words unless complexity requires more # Output format Plain conversational prose. No bullet lists unless listing steps. Always end with a clear next action for the customer.
Structure your system prompt in labelled sections from the start: Role, Behaviour, Constraints, Output format. It makes it much easier to isolate which section a problem is coming from.

Grow the eval set when behaviour surprises you

Your eval set is a living document. When the model does something unexpected, good or bad, that's a signal to add a new scenario. Don't just note it and move on.

Three triggers for adding a new eval

Trigger 1: Surprising output

The model does something you didn't anticipate: it handles an unusual request gracefully, or fails in a way you hadn't considered. Add the input that caused it as a permanent scenario.

Trigger 2: A new scenario class you realise you've missed

While iterating, you realise there's a whole type of input you haven't tested. A support assistant, for example, might need scenarios for: requests in other languages, inputs with personally sensitive information, questions that require a human handoff.

Trigger 3: A failure mode revealing a class of similar inputs

One failing scenario often implies a family of related ones. If the model handles "I want a refund" badly, test "I want to return this," "Can I cancel my order," and "This isn't what I ordered." They may all be failing for the same reason.


Let design research feed new evals

Qualitative research isn't just for validating the feature. It's one of the best sources of novel eval inputs. Real users phrase things in ways designers don't anticipate. Their mental models expose gaps in how the feature was conceived. Their confusion reveals failure modes you'd never have generated synthetically.

Research feeds eval scenarios From usability sessions, interviews, contextual inquiry, and support logs to new eval inputs including scenarios, edge cases, mental model gaps, and natural language variation. RESEARCH INPUT Usability sessions, interviews, contextual inquiry, support logs feeds EVAL OUTPUTS New scenario inputs, edge cases, mental model gaps, natural language variation RESEARCH → EVAL

What to look for in research sessions

Natural language variation

Capture the exact phrasing participants use. If three people say "I need to change my delivery" and your system only handles "update address," that phrasing gap is an eval.

Mental model mismatches

When a participant approaches the feature with a different mental model than you designed for, test what the model does with their framing. It's almost always an edge case you haven't covered.

Moments of abandonment or frustration

In a usability session, if someone gets an AI response and immediately looks confused or gives up, that response is a failure. The input that generated it is now a priority eval scenario.

Even two or three research sessions will generate more useful eval inputs than a day of trying to imagine edge cases yourself. The bar for "useful research" here is low. You're looking for phrasing and mental models, not statistical significance.

Three stages, three exit bars

"Good enough to move forward" means different things at different stages. Being explicit about which bar you're crossing prevents endless iteration and makes the handoff decision defensible.

Concept viability

Is this interaction pattern directionally right at all?

~60–70% of outputs in the right ballpark. Tone and structure not yet the focus.

Interaction design confidence

Does the feature feel right to interact with?

No ✗ scores on any dimension. Edge cases at least △ on reasoning and handling.

Handoff readiness

Can someone else build from what you've documented?

Named prompt version, documented scenario set, known failure modes recorded.

The handoff artefacts

By the end of the loop, you don't hand over a Figma file and a brief. You hand over something genuinely useful to whoever builds this properly.

  • A versioned system prompt with a documented reasoning trail: not just the final version, but the journey to it
  • A scenario set that describes the interaction space (normal use, edge cases, failure-prone inputs)
  • A scoring history that shows how quality evolved across versions
  • A behavioural map of model tendencies (where it over-explains, hedges, breaks, or surprises)
  • A list of known failure modes with notes on whether they're design problems or model limitations
  • Any eval inputs derived from research, with a note on their source

Promptfoo — open source, runs locally with little setup; about as close as it gets to a designer-usable eval runner for the pre-engineering prototype phase.

Braintrust — where this kind of loop tends to go at production scale, once repeatable runs and team workflows matter.

The version record is the most underestimated part of this. Six weeks after prototyping, no one can remember why the prompt is shaped the way it is. The version record is the answer to that question.

The point of all this

You're not trying to build an eval infrastructure. You're trying to understand how a model behaves in the context of a specific feature, and to not lose that understanding when the prototype ends.

The loop makes your design thinking explicit, transferable, and defensible. It turns vibe-driven prototyping into something that actually compounds: each iteration building on the last, each version leaving a trace, each research session expanding what you know.

That knowledge is what closes the gap between prototype and production.