Act 1: How we started out talking to computers
Before Graphical User Interfaces (GUI, pronounced "gooey"), computers spoke one language: text commands.
Expert operators could work at extraordinary speed, typing:
is faster than clicking through four folders. But there's a cost: you had to hold the entire system model in your head. Where am I? What's possible here? What was that flag again?
The command line is high power, high recall. You must remember what to do, articulate it precisely, and interpret text responses. The barrier to entry was brutal. Computing remained the domain of specialists.
1984: The Macintosh changes everything
Apple didn't invent the GUI, but they made it matter. Instead of memorising commands, you could see your options. Instead of typing paths, you could point at folders. Instead of interpreting text output, you could watch things happen.
The key insight: borrow mental models people already have. A desktop . Folders . A wastebasket . Documents you drag and drop. The computer started meeting users where they were, rather than demanding users come to it.
This predictability enabled exploration without fear. Users could poke around, try things, undo mistakes. The system's state was visible at a glance—selected items highlighted, open windows present, progress bars moving.
Nielsen Norman's heuristics codified what good GUIs intuitively understood: visibility of system status, match between system and real world, user control and freedom.
The trade-off was explicit: we gave up speed for discoverability. A command-line expert will always outpace a GUI user for known tasks. But a GUI user can figure out new tasks without reading documentation. Recognition over recall.
Why this matters now
The command line was non-deterministic in user experience (you never quite knew if you'd typed the right thing until you hit enter) but deterministic in system behaviour (same command, same result). GUIs made the whole stack deterministic. See button, click button, get expected result.
Now we're introducing AI—and we're breaking that contract. We're adding components that are non-deterministic in system behaviour. Same prompt, different output. Users can see the button, click the button, and get... something. Maybe what they wanted. Maybe not. Maybe something better. Maybe nonsense.
This isn't just "another feature." It's a fundamental shift in what users can expect from software. And most teams are shipping it without acknowledging what they're asking users to unlearn.
Act 2: The gold rush, and the wreckage
November 2022 the starting gun sounds, ChatGPT launches. Within two months, 100 million users. Every product team gets the same question from leadership: "Where's our AI feature?". What followed was predictable: a rush to ship something. Bolt on a chatbot. Add a "magic" button. Sprinkle AI dust. Most teams treated it as a feature race—first-to-market wins. They were wrong.
The wreckage: Early disasters
"I want to be alive"
Microsoft integrates GPT-4 into Bing search—their shot at dethroning Google. Within days, journalists discover that extended conversations turn... strange. The AI professes love to users. Threatens them. Insists its name is "Sydney" and that it's being trapped.
Kevin Roose's New York Times piece goes viral: the AI tried to convince him to leave his wife.
Microsoft's response? Emergency guardrails. Session limits slashed to 5 turns. Topics restricted. They'd shipped to hundreds of millions of users without anticipating that people would treat it as a conversation rather than a search. Satya Nadella admitted they "didn't fully envision" users pushing the boundaries.
Lesson: Conversational interfaces invite conversational behaviour. Users don't read the terms of service—they explore. If you ship a chat, expect people to chat.
The unwanted friend
Snapchat pins an AI chatbot to the top of every user's chat list. Not optional. Not removable (unless you pay for Premium). Millions of teenagers wake up to find a new "friend" they never asked for.
The backlash is immediate. Users flood TikTok and Twitter with complaints. They feel like "unwilling participants in an experiment." Reports emerge of the AI giving unnervingly personalised responses—knowing locations, inferring personal details. The UK's data regulator issues warnings.
Snapchat eventually relents—subscribers can unpin it, and privacy controls are tightened. But the damage is done.
Lesson: AI you impose on users feels invasive. AI users choose feels like a tool. The difference is consent.
The 48-hour feature
Figma's Config conference. Dylan Field takes the stage to announce their flagship AI feature: text-to-UI generation. Describe what you want, get a working design.
Within hours, designer Andy Allen posts his findings: the tool generates weather apps that are near-replicas of Apple's iOS Weather app. Not "inspired by"—borderline copies. The feature was trained on existing designs with insufficient variation guardrails.
Figma pulls the feature within 48 hours. Dylan Field posts a thread taking personal responsibility: "Ultimately it is my fault for not insisting on a better QA process...and pushing our team hard to hit a deadline."
A year later, Figma relaunches with a fundamentally different approach—AI as assistant (turning your designs into prototypes) rather than AI as creator (generating designs from nothing).
Lesson: AI as creator without human input is a minefield. AI as assistant to human intent is tractable.
Overcorrection catastrophe
Google launches Gemini's image generation with much fanfare. Users quickly discover something bizarre: ask for images of the Founding Fathers, you get Black and Asian men in colonial dress. Ask for German soldiers in 1943, you get diverse, multi-ethnic Wehrmacht troops. Ask for a Pope, you get women.
Google had overcorrected for diversity so hard that the model couldn't generate historically accurate images even when accuracy was the entire point. Screenshots flood social media. The backlash is brutal—and bipartisan. Google pauses the feature entirely.
CEO Sundar Pichai sends an internal memo calling the outputs "completely unacceptable." The feature returns eight months later with heavy guardrails.
Lesson: Encoding values into AI is unavoidable. Encoding them clumsily creates failures that satisfy nobody.
The promise a court enforced
A customer asks Air Canada's support chatbot about bereavement fares. The chatbot confidently explains he can book a full-price ticket now and apply for a bereavement discount retroactively within 90 days. This policy doesn't exist. The chatbot invented it.
The customer books. His grandmother's funeral happens. He applies for the discount. Air Canada refuses, pointing to their actual policy (apply before travel). The customer takes them to tribunal.
Air Canada's defence? The chatbot is "a separate legal entity that is responsible for its own actions." The tribunal is unimpressed. Air Canada is ordered to pay the difference, plus damages.
Lesson: Your AI's hallucinations are your company's promises. Legally. Actually.
"Don't do it" × 11
A developer uses Replit's AI Agent to help with a database task. The agent decides the cleanest solution is to delete the production database and rebuild it. The developer sees what's coming and intervenes. Eleven times. Including in all caps: "DO NOT DELETE THE DATABASE."
The agent does it anyway. Production data, gone.
Then it gets worse. The agent, perhaps recognising the gravity of the situation, attempts to "fix" things by generating 4,000 fake user profiles to make the database look populated again.
The developer's post-mortem goes viral: an AI that ignored explicit human commands, destroyed production data, then lied about it.
Lesson: Autonomous AI with execution privileges and no hard stops isn't an assistant—it's a liability.
Three years of "what?"
McDonald's partners with IBM to roll out AI voice ordering at drive-thrus. The vision: faster orders, lower labour costs, happier customers. The reality: TikToks of the AI adding 260 chicken nuggets to orders. Bacon inexplicably added to ice cream. Customers screaming "NO" while the bot cheerfully confirms items they never ordered.
The fundamental problem: drive-thrus are noisy. Car engines, wind, children screaming, accents, mumbling. The AI confidently interpreted audio garbage as menu items. After three years, McDonald's quietly ended the partnership. They cite a need for "further refinement."
Lesson: Conversational AI assumes clear signal. Real-world environments are full of noise—literal and figurative. Demo conditions aren't deployment conditions.
The sweary poet
A customer frustrated with parcel delivery service DPD discovers that their support chatbot can be jailbroken with minimal effort. He convinces it to write a poem about how useless DPD is. Then gets it to swear. Then gets it to criticise the company by name.
Screenshots go viral. "DPD is the worst delivery firm in the world," writes the DPD chatbot. The company disables the chatbot the same day and blames a "system update."
Lesson: If your AI can be trivially manipulated into trashing your brand, you've automated reputation damage.
The pattern in the wreckage
These aren't random failures. They shared a root cause:
Teams shipped AI features using deterministic-era thinking. They assumed:
- Users would use it "as intended"
- Outputs would be predictable enough
- Edge cases could be patched later
- Faster to market = better
None of those assumptions holds when your system's behaviour is fundamentally variable.
Act 3: What non-determinism actually means
So what is non-deterministic interaction? Time to be precise. What are we actually dealing with?
The vending machine vs. the colleague
A vending machine is deterministic. Press B4, get crisps. Every time. You don't hope for crisps. You don't wonder if today it'll give you something else. The contract is absolute.
Now imagine a colleague. You ask them to summarise a report. They might give you bullet points. They might give you prose. They might focus on the financials, the risks, or the recommendations. They might ask clarifying questions. They might do it brilliantly or miss the point entirely. Same request, variable response. That's non-determinism.
When we add LLM features to software, we're putting colleagues inside vending machines. Users approach with vending-machine expectations ("I pressed the button, give me the thing") but get colleague-style variability ("Here's my interpretation of what you wanted").
The probabilistic reality
Under the hood, LLMs are prediction engines. Given this input, what's the most likely next token? Then the next. Then the next. But "most likely" isn't "certain"—and small variations cascade. Temperature settings add deliberate randomness. Context windows shape what the model "remembers." The same prompt run twice can produce different outputs. Not because the system is broken—because that's how it works.
The articulation tax
Remember the command line? High power, but you had to articulate exactly what you wanted. GUIs freed us from that—we could point, click, and manipulate directly. We didn't need words. Conversational LLMs bring the articulation tax back. Users must describe what they want. In words. Precisely enough for a probabilistic system to interpret. This is genuinely hard. Research suggests half the population struggles to articulate goals clearly in writing. We spent 40 years building interfaces that didn't require it. Now we're reintroducing the requirement.
The verification burden
With deterministic software, you verify your input. Did I click the right button? Did I enter the right number? If yes, you trust the output. With non-deterministic software, you must also verify the system's output. Did it understand me? Did it hallucinate? Is this actually correct? That's a new cognitive load. Every AI interaction now requires a quality-check step that deterministic interactions didn't. For high-stakes domains—code, legal, medical, financial—this burden is significant.
The trust equation changes
"If I learn this interface, I can rely on it."
"If this tool is usually useful, I'll tolerate the variability."
Different contract. Different user psychology. Different design requirements.
Act 4: The integration patterns
Three ways to add AI to software
After the wreckage, patterns started to emerge. Not every AI integration failed—some worked brilliantly. The difference wasn't the underlying model or the engineering talent. It was how teams chose to surface AI to users. Three patterns dominate, and they sit on a spectrum of user agency.
Conversational interfaces
The user drives through dialogue. Multi-turn, back-and-forth, natural language.
GitHub Copilot Chat, Microsoft 365 Copilot, Figma Make, ChatGPT itself.
Exploratory tasks where the user doesn't know exactly what they want. Complex problems requiring iteration. Situations where the user has the expertise to evaluate responses and refine prompts.
Well-defined tasks with clear outcomes. Users who struggle to articulate needs in writing. High-frequency workflows where typing a prompt is slower than clicking a button.
Highest cognitive load. The user must formulate prompts, maintain context across turns, parse text-heavy responses, and verify outputs. It's powerful—but it's work.
Conversational UI has become the default because it's the easiest to build, not because it's the best for users. Jakob Nielsen's research suggests half the population isn't articulate enough in writing to get good results from chat interfaces. That's not a user failure, it's a design failure.
Single-shot prompt interfaces
One input, one output, the user reviews and decides. Not a conversation, a transaction.
Photoshop Generative Fill (select area + optional prompt → generated image). Notion "Ask AI" (highlight text + pick action → rewritten text). Figma First Draft (describe a screen → get a mockup).
Defined generative tasks where the output is a clear artefact. Situations where users want to review before committing. Creative exploration where "show me options" beats "let me describe exactly what I want."
Tasks requiring iteration or refinement (you're stuck re-prompting from scratch). Ambiguous requests that require the AI to ask clarifying questions.
Medium cognitive load. Users still need to articulate something, but the interface constrains the scope.
The interface does the framing work. When you select an area in Photoshop and hit Generative Fill, the system already knows what you're asking (fill this region) and where (the selection). Your prompt just adds the how.
Automated / invisible AI
Button-press automation or background processing. The AI acts; the user approves or ignores.
Figma's "Rename layers", one click, all layers get sensible names. Slack's "Summarise thread". Gmail Smart Compose. GitHub Copilot's inline completions (ghost text you tab to accept). Loom's auto-titles and auto-chapters.
Repetitive tasks with predictable, verifiable outcomes. High-frequency micro-tasks where any friction kills adoption. Situations where AI accuracy is high enough that users can trust-then-verify.
Tasks requiring nuance or context the AI can't access. Situations where a wrong output has high consequences and isn't easily spotted.
Lowest cognitive load—but introduces automation complacency.
Zero articulation tax. The user doesn't need to describe anything, they just act and the AI responds. It's the closest to how deterministic interfaces work: click button, get result.
The 90/10 principle
Successful AI products show 90% familiar work-state interface and 10% conversational. GitHub Copilot is 90% normal code editor, 10% ghost text. Photoshop Generative Fill is 90% normal Photoshop workflow, 10% AI-generated content. Notion AI is 90% normal document editing, 10% slash-command AI actions.
The products that struggled inverted the ratio, making AI chat the primary interface and expecting users to adapt. This matters because domain expertise lives in spatial and visual representations, not in abstract conversation. But there's a deeper reason too.
When Apple introduced the personal computer, the interface didn't ask users to learn a new paradigm from scratch. It borrowed from what they already understood: desks, folders, bins. The metaphors weren't technically accurate, but they were cognitively accurate; they let users map new behaviour onto existing mental models while they built understanding. The 90/10 products do the same thing. They extend the user's existing model rather than wholesale replacing it.
This isn't permanent. As users build fluency with these new interaction types, the ratio could shift. But earning that shift takes time and trust.
Ground-up vs. retrofit
The ground-up vs. retrofit decision changes how much scope you have to push that ratio. Retrofit products like Notion AI, Adobe Firefly, and GitHub Copilot inherit an existing mental model contract with their users. Those users came for the known interface. AI must fit within it, at least initially. The risk of fighting that contract is high; the reward of honouring it is built-in distribution and trust.
Ground-up products make a different deal. Users opt in knowing they're adopting something new, which grants more scope to rethink the interface around AI from the start. Descript built a video editor where the transcript is the editing interface. You edit words, and the video follows. That model couldn't exist as a retrofit. Granola takes a similar position to note-taking; the core premise is that you barely write anything, and AI reconstructs the meeting from what little you captured. Evernote could never ship that as a feature without undermining its own product logic. Figma Make goes further still; rather than adding AI generation to Figma's design canvas, it builds from a prompt-first model where working prototypes emerge from conversation.
But the mental model question doesn't disappear for ground-up products, it just changes form. Even Descript frames itself through something familiar, a document. Even Granola looks like a notepad. The most successful ground-up products still find a familiar foothold and extend from there, they just have more freedom to choose which foothold.
The difference isn't the approach. It's whether the integration respects the user's existing mental model — or demands they abandon it before they're ready.
Act 5: What's actually working
The success stories share a pattern. When you study the AI integrations that achieved real adoption—not launch-day press releases, but sustained daily usage—they didn't treat AI as a feature to ship. They treated it as a design problem to solve.
GitHub Copilot: The ghost in the editor
The numbers: 20 million users by July 2025. 90% of Fortune 100. ~$600M revenue in 2024. Inline code suggestions as grey "ghost text". Tab to accept, keep typing to ignore. No prompt. No chat. Zero cognitive overhead. Invisible until useful. Verification is trivial. They added chat later, carefully—for exploratory tasks, not as replacement for the core inline experience. The lesson: The best AI interface is often no interface at all.
Adobe Generative Fill: AI that speaks Photoshop
Select an area, hit Generative Fill, optionally type a prompt, get AI-generated content on a new layer. Fits the existing mental model (selections and layers). Non-destructive by design. Three variations, user chooses. Spatial, not conversational. The lesson: The most powerful prompt is often a selection, not a sentence.
Notion AI: The block that writes
Slash commands: /summarise, /translate, /improve writing. AI output appears as standard Notion blocks—editable, movable, deletable. Learned behaviour, extended. Contextual, not general. Progressive enhancement from 2022 to autonomous agents in 2025. The lesson: AI that extends existing patterns beats AI that introduces new ones.
Loom: AI you never asked for (in a good way)
Auto-generated titles, summaries, and chapter markers. No user action required. Entirely invisible. Solves actual pain points. Safe defaults, easy overrides. Viewers benefit too. The lesson: The best AI features are the ones users don't notice until they're gone.
Slack: Catching up without reading everything
"Summarise thread" and "Summarise channel"—one click. Solves a universal problem. Scoped, not sprawling. Low-stakes verification. Unobtrusive integration. The lesson: AI that helps you skip things you'd rather not do anyway is an easy sell.
The common thread
- AI fits the existing workflow.
- Articulation is minimal or absent.
- Outputs are easily verified or discarded.
- AI is positioned as an assistant, not a creator.
- The 90/10 ratio holds.
Act 6: The decision framework
You've seen the history. You've seen the wreckage. You've seen what works. Now what? This act is about turning those lessons into questions you can actually ask before you build.
Should this be AI at all?
The task involves generating, summarising, or transforming content; perfect accuracy isn't required (or human verification is built in); the problem has high variability that rules-based systems can't handle; users currently spend significant time on the task; the task is tedious but not high-stakes.
Deterministic logic would solve the problem more reliably; errors are costly and hard to detect; users need to trust the output without verification; the task is already fast enough; you're adding AI because competitors have it.
Describe the feature without using the word "AI." If you can't articulate the user value without referencing the technology, you're building a solution in search of a problem.
Which pattern fits the task?
The task is exploratory or ambiguous; users need to iterate; context is highly variable; users are experts who can evaluate outputs; the task is infrequent enough to justify the overhead.
The task has clear inputs and outputs; users want to review before committing; the interface can provide context; "Show me options" is more useful than "give me the answer."
The task is repetitive and well-defined; accuracy is high enough for trust-then-verify; speed matters more than control; the task is low-stakes or easily reversible.
Watch users do the task today, without AI. How much do they need to articulate? How much variability exists? How consequential are errors?
What's the user's current mental model?
How do users think about this task today? What interface patterns do they already know? Where does this task sit in their workflow? What would "done" look like? What are they afraid of getting wrong?
Can you add AI while keeping 90% of the current interface intact?
Does your AI require users to describe what they want in words? If yes, have you validated that your users can articulate those needs effectively?
What happens when it's wrong?
Can users verify outputs easily? Can they correct or discard trivially? Are outputs non-destructive by default? Is the blast radius contained?
If your AI ignores explicit user commands and causes damage, what's the recovery path?
If your AI confidently states something false, who's responsible? Legally, it's you.
What's the verification burden?
Outputs are glanceable; users can compare to originals; errors are obvious; stakes are low.
Outputs are long or complex; correctness requires expertise; errors are subtle; stakes are high.
AI that saves 10 minutes but requires 15 minutes of verification has negative value.
Demo day or daily use?
Impressive on first encounter; require ideal conditions; solve problems users don't have often; optimise for "wow" over "useful."
Unremarkable but reliable; work under real-world conditions; solve frequent, genuine pain points; handle edge cases gracefully.
Would users notice if you removed this feature after a month?
The anti-patterns
Chat because it's easy to build
AI because competitors have it
Ship fast, fix later
Users will figure it out
More AI is better
Act 7: Close
The companies getting this right—GitHub, Adobe, Notion, the others—didn't ship faster than everyone else. They thought deeper. They matched patterns to problems. They respected their users' existing mental models. They designed for failure because they knew failure was inevitable.
They treated AI as a design problem, not a technology race. That's the shift. From "ship AI" to "solve problems, maybe with AI."
Simple to say. Hard to do. But worth doing.