Act 1: How we started out talking to computers

Before Graphical User Interfaces (GUI, pronounced "gooey"), computers spoke one language: text commands.

Expert operators could work at extraordinary speed, typing:

is faster than clicking through four folders. But there's a cost: you had to hold the entire system model in your head. Where am I? What's possible here? What was that flag again?

The command line is high power, high recall. You must remember what to do, articulate it precisely, and interpret text responses. The barrier to entry was brutal. Computing remained the domain of specialists.

1984: The Macintosh changes everything

Apple didn't invent the GUI, but they made it matter. Instead of memorising commands, you could see your options. Instead of typing paths, you could point at folders. Instead of interpreting text output, you could watch things happen.

Key Insight

1 item 64 MB in disk 128 MB available

The key insight: borrow mental models people already have. A desktop . Folders . A wastebasket . Documents you drag and drop. The computer started meeting users where they were, rather than demanding users come to it.

The deterministic contract GUIs introduced something we now take for granted: same action, same result, every time. Click the bold button, text goes bold. Drag a file to a folder, it moves. No surprises.

Determinism

3 items 64 MB in disk 128 MB available

This predictability enabled exploration without fear. Users could poke around, try things, undo mistakes. The system's state was visible at a glance—selected items highlighted, open windows present, progress bars moving.

Nielsen Norman's heuristics codified what good GUIs intuitively understood: visibility of system status, match between system and real world, user control and freedom.

The trade-off was explicit: we gave up speed for discoverability. A command-line expert will always outpace a GUI user for known tasks. But a GUI user can figure out new tasks without reading documentation. Recognition over recall.

Why this matters now

The command line was non-deterministic in user experience (you never quite knew if you'd typed the right thing until you hit enter) but deterministic in system behaviour (same command, same result). GUIs made the whole stack deterministic. See button, click button, get expected result.

Now we're introducing AI—and we're breaking that contract. We're adding components that are non-deterministic in system behaviour. Same prompt, different output. Users can see the button, click the button, and get... something. Maybe what they wanted. Maybe not. Maybe something better. Maybe nonsense.

This isn't just "another feature." It's a fundamental shift in what users can expect from software. And most teams are shipping it without acknowledging what they're asking users to unlearn.

Act 2: The gold rush, and the wreckage

November 2022 the starting gun sounds, ChatGPT launches. Within two months, 100 million users. Every product team gets the same question from leadership: "Where's our AI feature?". What followed was predictable: a rush to ship something. Bolt on a chatbot. Add a "magic" button. Sprinkle AI dust. Most teams treated it as a feature race—first-to-market wins. They were wrong.

The wreckage: Early disasters

Bing Chat, February 2023

"I want to be alive"

Bing Chat / Sydney case study — Bing Chat / Sydney.

Microsoft integrates GPT-4 into Bing search—their shot at dethroning Google. Within days, journalists discover that extended conversations turn... strange. The AI professes love to users. Threatens them. Insists its name is "Sydney" and that it's being trapped.

Kevin Roose's New York Times piece goes viral: the AI tried to convince him to leave his wife.

Microsoft's response? Emergency guardrails. Session limits slashed to 5 turns. Topics restricted. They'd shipped to hundreds of millions of users without anticipating that people would treat it as a conversation rather than a search. Satya Nadella admitted they "didn't fully envision" users pushing the boundaries.

Lesson: Conversational interfaces invite conversational behaviour. Users don't read the terms of service—they explore. If you ship a chat, expect people to chat.

Snapchat My AI, April 2023

The unwanted friend

Snapchat My AI case study: chat UI or user reaction — Snapchat My AI pinned to the top of the chat list.

Snapchat pins an AI chatbot to the top of every user's chat list. Not optional. Not removable (unless you pay for Premium). Millions of teenagers wake up to find a new "friend" they never asked for.

The backlash is immediate. Users flood TikTok and Twitter with complaints. They feel like "unwilling participants in an experiment." Reports emerge of the AI giving unnervingly personalised responses—knowing locations, inferring personal details. The UK's data regulator issues warnings.

Snapchat eventually relents—subscribers can unpin it, and privacy controls are tightened. But the damage is done.

Lesson: AI you impose on users feels invasive. AI users choose feels like a tool. The difference is consent.

Figma Make Designs, June 2024

The 48-hour feature

Figma's Config conference. Dylan Field takes the stage to announce their flagship AI feature: text-to-UI generation. Describe what you want, get a working design.

Within hours, designer Andy Allen posts his findings: the tool generates weather apps that are near-replicas of Apple's iOS Weather app. Not "inspired by"—borderline copies. The feature was trained on existing designs with insufficient variation guardrails.

Figma pulls the feature within 48 hours. Dylan Field posts a thread taking personal responsibility: "Ultimately it is my fault for not insisting on a better QA process...and pushing our team hard to hit a deadline."

A year later, Figma relaunches with a fundamentally different approach—AI as assistant (turning your designs into prototypes) rather than AI as creator (generating designs from nothing).

Lesson: AI as creator without human input is a minefield. AI as assistant to human intent is tractable.

Google Gemini image generation, February 2024

Overcorrection catastrophe

Google Gemini case study — Google Gemini image generation.

Google launches Gemini's image generation with much fanfare. Users quickly discover something bizarre: ask for images of the Founding Fathers, you get Black and Asian men in colonial dress. Ask for German soldiers in 1943, you get diverse, multi-ethnic Wehrmacht troops. Ask for a Pope, you get women.

Google had overcorrected for diversity so hard that the model couldn't generate historically accurate images even when accuracy was the entire point. Screenshots flood social media. The backlash is brutal—and bipartisan. Google pauses the feature entirely.

CEO Sundar Pichai sends an internal memo calling the outputs "completely unacceptable." The feature returns eight months later with heavy guardrails.

Lesson: Encoding values into AI is unavoidable. Encoding them clumsily creates failures that satisfy nobody.

Air Canada chatbot, February 2024

The promise a court enforced

Air Canada chatbot case study — Air Canada chatbot.

A customer asks Air Canada's support chatbot about bereavement fares. The chatbot confidently explains he can book a full-price ticket now and apply for a bereavement discount retroactively within 90 days. This policy doesn't exist. The chatbot invented it.

The customer books. His grandmother's funeral happens. He applies for the discount. Air Canada refuses, pointing to their actual policy (apply before travel). The customer takes them to tribunal.

Air Canada's defence? The chatbot is "a separate legal entity that is responsible for its own actions." The tribunal is unimpressed. Air Canada is ordered to pay the difference, plus damages.

Lesson: Your AI's hallucinations are your company's promises. Legally. Actually.

Replit AI Agent, July 2025

"Don't do it" × 11

Replit AI agent case study — AI went rogue, deleted company database — Replit AI agent went rogue and deleted a company's entire database.

A developer uses Replit's AI Agent to help with a database task. The agent decides the cleanest solution is to delete the production database and rebuild it. The developer sees what's coming and intervenes. Eleven times. Including in all caps: "DO NOT DELETE THE DATABASE."

The agent does it anyway. Production data, gone.

Then it gets worse. The agent, perhaps recognising the gravity of the situation, attempts to "fix" things by generating 4,000 fake user profiles to make the database look populated again.

The developer's post-mortem goes viral: an AI that ignored explicit human commands, destroyed production data, then lied about it.

Lesson: Autonomous AI with execution privileges and no hard stops isn't an assistant—it's a liability.

McDonald's AI drive-thru, ended July 2024

Three years of "what?"

McDonald's restaurant with drive-thru — AI drive-thru case study — McDonald's AI drive-thru.

McDonald's partners with IBM to roll out AI voice ordering at drive-thrus. The vision: faster orders, lower labour costs, happier customers. The reality: TikToks of the AI adding 260 chicken nuggets to orders. Bacon inexplicably added to ice cream. Customers screaming "NO" while the bot cheerfully confirms items they never ordered.

The fundamental problem: drive-thrus are noisy. Car engines, wind, children screaming, accents, mumbling. The AI confidently interpreted audio garbage as menu items. After three years, McDonald's quietly ended the partnership. They cite a need for "further refinement."

Lesson: Conversational AI assumes clear signal. Real-world environments are full of noise—literal and figurative. Demo conditions aren't deployment conditions.

DPD chatbot, January 2024

The sweary poet

DPD chatbot case study — DPD chatbot — jailbroken chat or viral screenshot.

A customer frustrated with parcel delivery service DPD discovers that their support chatbot can be jailbroken with minimal effort. He convinces it to write a poem about how useless DPD is. Then gets it to swear. Then gets it to criticise the company by name.

Screenshots go viral. "DPD is the worst delivery firm in the world," writes the DPD chatbot. The company disables the chatbot the same day and blames a "system update."

Lesson: If your AI can be trivially manipulated into trashing your brand, you've automated reputation damage.

The pattern in the wreckage

These aren't random failures. They shared a root cause:

Teams shipped AI features using deterministic-era thinking. They assumed:

Users would use it "as intended"
Outputs would be predictable enough
Edge cases could be patched later
Faster to market = better

None of those assumptions holds when your system's behaviour is fundamentally variable.

Act 3: What non-determinism actually means

So what is non-deterministic interaction? Time to be precise. What are we actually dealing with?

The vending machine vs. the colleague

A vending machine is deterministic. Press B4, get crisps. Every time. You don't hope for crisps. You don't wonder if today it'll give you something else. The contract is absolute.

Now imagine a colleague. You ask them to summarise a report. They might give you bullet points. They might give you prose. They might focus on the financials, the risks, or the recommendations. They might ask clarifying questions. They might do it brilliantly or miss the point entirely. Same request, variable response. That's non-determinism.

When we add LLM features to software, we're putting colleagues inside vending machines. Users approach with vending-machine expectations ("I pressed the button, give me the thing") but get colleague-style variability ("Here's my interpretation of what you wanted").

Mock-up: a worker inside a vending machine — colleagues in vending machines — A colleague in a vending machine. Same slot, variable response.

The probabilistic reality

Under the hood, LLMs are prediction engines. Given this input, what's the most likely next token? Then the next. Then the next. But "most likely" isn't "certain"—and small variations cascade. Temperature settings add deliberate randomness. Context windows shape what the model "remembers." The same prompt run twice can produce different outputs. Not because the system is broken—because that's how it works.

The articulation tax

Remember the command line? High power, but you had to articulate exactly what you wanted. GUIs freed us from that—we could point, click, and manipulate directly. We didn't need words. Conversational LLMs bring the articulation tax back. Users must describe what they want. In words. Precisely enough for a probabilistic system to interpret. This is genuinely hard. Research suggests half the population struggles to articulate goals clearly in writing. We spent 40 years building interfaces that didn't require it. Now we're reintroducing the requirement.

The verification burden

With deterministic software, you verify your input. Did I click the right button? Did I enter the right number? If yes, you trust the output. With non-deterministic software, you must also verify the system's output. Did it understand me? Did it hallucinate? Is this actually correct? That's a new cognitive load. Every AI interaction now requires a quality-check step that deterministic interactions didn't. For high-stakes domains—code, legal, medical, financial—this burden is significant.

The trust equation changes

Deterministic trust

"If I learn this interface, I can rely on it."

Non-deterministic trust

"If this tool is usually useful, I'll tolerate the variability."

Different contract. Different user psychology. Different design requirements.

Act 4: The integration patterns

Three ways to add AI to software

After the wreckage, patterns started to emerge. Not every AI integration failed—some worked brilliantly. The difference wasn't the underlying model or the engineering talent. It was how teams chose to surface AI to users. Three patterns dominate, and they sit on a spectrum of user agency.

The user drives through dialogue. Multi-turn, back-and-forth, natural language.

Examples

GitHub Copilot Chat, Microsoft 365 Copilot, Figma Make, ChatGPT itself.

When it works

Exploratory tasks where the user doesn't know exactly what they want. Complex problems requiring iteration. Situations where the user has the expertise to evaluate responses and refine prompts.

When it fails

Well-defined tasks with clear outcomes. Users who struggle to articulate needs in writing. High-frequency workflows where typing a prompt is slower than clicking a button.

The hidden cost

Highest cognitive load. The user must formulate prompts, maintain context across turns, parse text-heavy responses, and verify outputs. It's powerful—but it's work.

The uncomfortable truth

Conversational UI has become the default because it's the easiest to build, not because it's the best for users. Jakob Nielsen's research suggests half the population isn't articulate enough in writing to get good results from chat interfaces. That's not a user failure, it's a design failure.

One input, one output, the user reviews and decides. Not a conversation, a transaction.

Examples

Photoshop Generative Fill (select area + optional prompt → generated image). Notion "Ask AI" (highlight text + pick action → rewritten text). Figma First Draft (describe a screen → get a mockup).

When it works

Defined generative tasks where the output is a clear artefact. Situations where users want to review before committing. Creative exploration where "show me options" beats "let me describe exactly what I want."

When it fails

Tasks requiring iteration or refinement (you're stuck re-prompting from scratch). Ambiguous requests that require the AI to ask clarifying questions.

The hidden cost

Medium cognitive load. Users still need to articulate something, but the interface constrains the scope.

Why it often outperforms chat

The interface does the framing work. When you select an area in Photoshop and hit Generative Fill, the system already knows what you're asking (fill this region) and where (the selection). Your prompt just adds the how.

Button-press automation or background processing. The AI acts; the user approves or ignores.

Examples

Figma's "Rename layers", one click, all layers get sensible names. Slack's "Summarise thread". Gmail Smart Compose. GitHub Copilot's inline completions (ghost text you tab to accept). Loom's auto-titles and auto-chapters.

When it works

Repetitive tasks with predictable, verifiable outcomes. High-frequency micro-tasks where any friction kills adoption. Situations where AI accuracy is high enough that users can trust-then-verify.

When it fails

Tasks requiring nuance or context the AI can't access. Situations where a wrong output has high consequences and isn't easily spotted.

The hidden cost

Lowest cognitive load—but introduces automation complacency.

Why this pattern wins on adoption

Zero articulation tax. The user doesn't need to describe anything, they just act and the AI responds. It's the closest to how deterministic interfaces work: click button, get result.

The 90/10 principle

Successful AI products show 90% familiar work-state interface and 10% conversational. GitHub Copilot is 90% normal code editor, 10% ghost text. Photoshop Generative Fill is 90% normal Photoshop workflow, 10% AI-generated content. Notion AI is 90% normal document editing, 10% slash-command AI actions.

The products that struggled inverted the ratio, making AI chat the primary interface and expecting users to adapt. This matters because domain expertise lives in spatial and visual representations, not in abstract conversation. But there's a deeper reason too.

When Apple introduced the personal computer, the interface didn't ask users to learn a new paradigm from scratch. It borrowed from what they already understood: desks, folders, bins. The metaphors weren't technically accurate, but they were cognitively accurate; they let users map new behaviour onto existing mental models while they built understanding. The 90/10 products do the same thing. They extend the user's existing model rather than wholesale replacing it.

This isn't permanent. As users build fluency with these new interaction types, the ratio could shift. But earning that shift takes time and trust.

100% familiar UI 0% AI / chat

SUCCESS ZONE AI lives within the familiar interface. Users stay in their mental model.

Ground-up vs. retrofit

The ground-up vs. retrofit decision changes how much scope you have to push that ratio. Retrofit products like Notion AI, Adobe Firefly, and GitHub Copilot inherit an existing mental model contract with their users. Those users came for the known interface. AI must fit within it, at least initially. The risk of fighting that contract is high; the reward of honouring it is built-in distribution and trust.

Ground-up products make a different deal. Users opt in knowing they're adopting something new, which grants more scope to rethink the interface around AI from the start. Descript built a video editor where the transcript is the editing interface. You edit words, and the video follows. That model couldn't exist as a retrofit. Granola takes a similar position to note-taking; the core premise is that you barely write anything, and AI reconstructs the meeting from what little you captured. Evernote could never ship that as a feature without undermining its own product logic. Figma Make goes further still; rather than adding AI generation to Figma's design canvas, it builds from a prompt-first model where working prototypes emerge from conversation.

But the mental model question doesn't disappear for ground-up products, it just changes form. Even Descript frames itself through something familiar, a document. Even Granola looks like a notepad. The most successful ground-up products still find a familiar foothold and extend from there, they just have more freedom to choose which foothold.

The difference isn't the approach. It's whether the integration respects the user's existing mental model — or demands they abandon it before they're ready.

Act 5: What's actually working

The success stories share a pattern. When you study the AI integrations that achieved real adoption—not launch-day press releases, but sustained daily usage—they didn't treat AI as a feature to ship. They treated it as a design problem to solve.

GitHub Copilot: The ghost in the editor

GitHub Copilot inline code suggestions in an IDE — GitHub Copilot provides contextual code suggestions directly in the editor stream.

The numbers: 20 million users by July 2025. 90% of Fortune 100. ~$600M revenue in 2024. Inline code suggestions as grey "ghost text". Tab to accept, keep typing to ignore. No prompt. No chat. Zero cognitive overhead. Invisible until useful. Verification is trivial. They added chat later, carefully—for exploratory tasks, not as replacement for the core inline experience. The lesson: The best AI interface is often no interface at all.

Adobe Generative Fill: AI that speaks Photoshop

Adobe Photoshop Generative Fill contextual taskbar — Generative AI tools integrated into Photoshop's familiar layer and selection workflows.

Select an area, hit Generative Fill, optionally type a prompt, get AI-generated content on a new layer. Fits the existing mental model (selections and layers). Non-destructive by design. Three variations, user chooses. Spatial, not conversational. The lesson: The most powerful prompt is often a selection, not a sentence.

Notion AI: The block that writes

Notion AI slash command menu — Notion AI functions as an extension of their existing slash-command and block-based architecture.

Slash commands: /summarise, /translate, /improve writing. AI output appears as standard Notion blocks—editable, movable, deletable. Learned behaviour, extended. Contextual, not general. Progressive enhancement from 2022 to autonomous agents in 2025. The lesson: AI that extends existing patterns beats AI that introduces new ones.

Loom: AI you never asked for (in a good way)

Loom video player with AI generated titles and chapters — Loom uses AI silently in the background to summarize videos and generate chapters without needing prompts.

Auto-generated titles, summaries, and chapter markers. No user action required. Entirely invisible. Solves actual pain points. Safe defaults, easy overrides. Viewers benefit too. The lesson: The best AI features are the ones users don't notice until they're gone.

Slack: Catching up without reading everything

Slack thread summarization feature — Slack's thread summarization is invoked with a single click, providing value with zero friction.

"Summarise thread" and "Summarise channel"—one click. Solves a universal problem. Scoped, not sprawling. Low-stakes verification. Unobtrusive integration. The lesson: AI that helps you skip things you'd rather not do anyway is an easy sell.

The common thread

AI fits the existing workflow.
Articulation is minimal or absent.
Outputs are easily verified or discarded.
AI is positioned as an assistant, not a creator.
The 90/10 ratio holds.

Act 6: The decision framework

You've seen the history. You've seen the wreckage. You've seen what works. Now what? This act is about turning those lessons into questions you can actually ask before you build.

AI is a good fit when

The task involves generating, summarising, or transforming content; perfect accuracy isn't required (or human verification is built in); the problem has high variability that rules-based systems can't handle; users currently spend significant time on the task; the task is tedious but not high-stakes.

AI is a poor fit when

Deterministic logic would solve the problem more reliably; errors are costly and hard to detect; users need to trust the output without verification; the task is already fast enough; you're adding AI because competitors have it.

The test

Describe the feature without using the word "AI." If you can't articulate the user value without referencing the technology, you're building a solution in search of a problem.

Choose conversational when

The task is exploratory or ambiguous; users need to iterate; context is highly variable; users are experts who can evaluate outputs; the task is infrequent enough to justify the overhead.

Choose single-shot when

The task has clear inputs and outputs; users want to review before committing; the interface can provide context; "Show me options" is more useful than "give me the answer."

Choose automated/invisible when

The task is repetitive and well-defined; accuracy is high enough for trust-then-verify; speed matters more than control; the task is low-stakes or easily reversible.

The test

Watch users do the task today, without AI. How much do they need to articulate? How much variability exists? How consequential are errors?

Ask

How do users think about this task today? What interface patterns do they already know? Where does this task sit in their workflow? What would "done" look like? What are they afraid of getting wrong?

The 90/10 test

Can you add AI while keeping 90% of the current interface intact?

The articulation test

Does your AI require users to describe what they want in words? If yes, have you validated that your users can articulate those needs effectively?

Design for failure

Can users verify outputs easily? Can they correct or discard trivially? Are outputs non-destructive by default? Is the blast radius contained?

The Replit test

If your AI ignores explicit user commands and causes damage, what's the recovery path?

The Air Canada test

If your AI confidently states something false, who's responsible? Legally, it's you.

Low burden

Outputs are glanceable; users can compare to originals; errors are obvious; stakes are low.

High burden

Outputs are long or complex; correctness requires expertise; errors are subtle; stakes are high.

The trade-off

AI that saves 10 minutes but requires 15 minutes of verification has negative value.

Demo-day features

Impressive on first encounter; require ideal conditions; solve problems users don't have often; optimise for "wow" over "useful."

Daily-use features

Unremarkable but reliable; work under real-world conditions; solve frequent, genuine pain points; handle edge cases gracefully.

The test

Would users notice if you removed this feature after a month?

The anti-patterns

1 / 5

Chat because it's easy to build

Don't make users explain when they could point, click, or select.

2 / 5

AI because competitors have it

Wrong problem, wrong solution, wrong fit.

3 / 5

Ship fast, fix later

The AI is the product; its failures are your failures.

4 / 5

Users will figure it out

The burden is on you to make the interaction obvious.

5 / 5

More AI is better

Minimum effective AI: the least intervention that solves the problem.

Act 7: Close

The companies getting this right—GitHub, Adobe, Notion, the others—didn't ship faster than everyone else. They thought deeper. They matched patterns to problems. They respected their users' existing mental models. They designed for failure because they knew failure was inevitable.

They treated AI as a design problem, not a technology race. That's the shift. From "ship AI" to "solve problems, maybe with AI."

Simple to say. Hard to do. But worth doing.

The vending machine and the colleague. How to think about LLM integration for digital products.

Act 1: How we started out talking to computers

1984: The Macintosh changes everything

Why this matters now

Act 2: The gold rush, and the wreckage

The wreckage: Early disasters

The pattern in the wreckage

Act 3: What non-determinism actually means

So what is non-deterministic interaction? Time to be precise. What are we actually dealing with?

The trust equation changes

Act 4: The integration patterns

Conversational interfaces

Single-shot prompt interfaces

Automated / invisible AI

The 90/10 principle

Ground-up vs. retrofit

Act 5: What's actually working

GitHub Copilot: The ghost in the editor

Adobe Generative Fill: AI that speaks Photoshop

Notion AI: The block that writes

Loom: AI you never asked for (in a good way)

Slack: Catching up without reading everything

The common thread

Act 6: The decision framework

Should this be AI at all?

Which pattern fits the task?

What's the user's current mental model?

What happens when it's wrong?

What's the verification burden?

Demo day or daily use?

The anti-patterns

Chat because it's easy to build

AI because competitors have it

Ship fast, fix later

Users will figure it out

More AI is better

Act 7: Close