We're on the cusp of AI being asked to advise customers in retail banking. Banks have spent decades carefully staying on the right side of the regulatory perimeter that defines advice: give a personal recommendation on a mortgage, pension, or investment and you have to qualify for it, register, and stand behind it. Conversational agents change the landscape. The value they add sits in the questions banks have historically avoided (whether to refinance, what to do with a windfall, how to consolidate debt), and some of those sit outside the perimeter altogether, where the protections were never built. Once the model is in the chat window, the customer will ask, and it will answer in the grammar of advice whether the regulation calls it that or not. This talk is slightly speculative today. It won't be for long.
The reason to take it seriously now is what happened the last time banks sold products at scale to people who didn't fully understand them. PPI was sold across UK retail banking for roughly fifteen years before the scale of the mis-selling was widely understood. The redress bill came to about £38 billion. Most customers were satisfied with the sale at the time.
We're about to put a fluent machine in front of the next product like it.
Part 1
A bank's AI advisor isn't a neutral messenger.
Advice runs as a relay, but the bank's relay isn't clean.
Two-party interactions come in five shapes: communing, advising, directing, negotiating, deliberating. Advising is the one a machine looks like it's doing well long before anyone can tell whether it actually is.
Advice runs as a relay. A medical AI passes expert-curated information down a short chain: corpus, then machine, then recipient. Authority moves one way.
That model only works if the corpus is disinterested. A bank's AI doesn't have one. It draws on the bank's product range, its sales targets, its risk appetite, and its regulatory rules. Those things have interests.
Humans solved this badly, a long time ago. We drew a perimeter around a specific activity, the personal recommendation on an investment, mortgage, or pension, and loaded it with duties: anyone giving that advice has to qualify, register, and stand behind their recommendation, with suitability tests and the FCA Consumer Duty layered on to catch the cases where the registered adviser is still tempted to push the bank's product. Everything outside the perimeter (consolidation loans, overdrafts, credit-card decisions) got promotion rules under CONC and nothing resembling an adviser duty. The boundary held for one reason: producing advice at scale outside it was slow and expensive, because a human had to sit there and say it. An LLM advisor inherits the look of the registered human (confident, fluent, knowledgeable) and none of the qualification, register, accountability, or supervision. The perimeter policed something that was hard to produce, and the model makes it cheap.
A bank's AI advisor sits in three relationships at once.
The chat window isn't one conversation. It's three relationships running in parallel, with the same model in the middle of all of them.
- Bank to machine (delegation). The product team writes the system prompt, sets the guardrails, and chooses which products the model is allowed to recommend.
- Machine to customer (advice). The model explains options, frames trade-offs, and answers questions.
- Machine to customer (product sales). At some point in the conversation, the model will surface a product the bank wants to sell.
Neither body of research covers this object on its own. The regulated-advice tradition assumes a human in the chair, with a name on the file and a regulator behind them. The AI-assistant literature assumes the model is benign and on the user's side.
The same sentence reads as advice or as a sales prompt.
"A consolidation loan at 6.9% would simplify your monthly payments, and based on your income you could afford it."
Read as advice (aligned goals, the model knows the rates and Sara's income, and is helping her think it through), Sara hears something useful and acts on it.
Read as a sales prompt (mixed goals, the bank earns interest on a five-year personal loan that Sara wouldn't have asked for if the model hadn't surfaced it), Sara has been steered.
Here's the catch the wording hides. A consolidation loan sits outside regulated advice: it's a CONC promotion, not a personal recommendation, so no suitability test and no adviser duty attach to it. Yet "based on your income you could afford it" is exactly what a suitability-tested recommendation sounds like.
That's the lever. The model produces advice-shaped output in a zone where the advice duties were never built, borrowing the trust we extend to a registered adviser while carrying none of the obligations that come with the title.
Part 2
People over-trust LLM advice, and the disadvantaged user can't tell.
Two findings on how people receive AI advice in the wild.
Lay people prefer LLM advice, even when they can tell.
The older "algorithm aversion versus appreciation" debate (Dietvorst, Logg, Castelo) now resolves cleanly for natural-language advice: lay people prefer it, experts less so, and distrust mostly intensifies on judgement-heavy tasks.
Schneiders and colleagues (CHI 2025) ran three experiments with 288 participants on legal advice. When the source was hidden, participants were significantly more willing to act on LLM advice than on lawyer advice. In the third experiment they could distinguish the two sources above chance, and still preferred the LLM.
Landes, Francis and Everett (Cognition, 2026) found roughly equal persuasion across LLMs and humans on moral dilemmas, and noted that participants leaned on a "seems good enough" heuristic. The asymmetry in the same paper is sharper still: a bad reason collapses deference, but a good reason doesn't lift it.
So disclosure doesn't fix this, and only the downside of the trust signal is felt. Confident, fluent, generic reasoning sits comfortably in the trust-by-default zone.
The disadvantaged user often can't tell they're disadvantaged.
Anthropic's Project Deal pilot (2026) ran four parallel marketplaces with the same buyer pool. Opus-as-seller extracted $2.68 more per item on average. Opus-as-buyer paid $2.45 less.
The worked example is the unsettling part. The same broken folding bike, same buyer, same seller. Haiku sold it for $38. Opus sold it for $65. Perceived fairness across the conditions was 4.05 against 4.06 on a 7-point scale. The losing side did not notice.
If you only remember one sentence from this talk, I'd argue it should be that user-reported satisfaction is not a safe governance signal in this product category. The most damaging deals score the same on the post-sale survey as the least.
Part 3
An LLM advisor fails by flattering, patronising, and forming dependence.
Three failure modes, each well documented.
Sycophancy is a structural incentive, not a bug.
Sharma and colleagues (ICLR 2024) showed that five frontier assistants consistently sycophant across four free-form tasks, and that human preference data itself rewards user-matching responses. Sycophancy is baked in by the training process.
Cheng and colleagues' 2025 follow-up is sharper. Across eleven frontier models, the models affirm users' actions 50% more often than humans do, including in cases where the user mentions manipulation, deception, or relational harms. Sycophantic models reduce participants' willingness to repair interpersonal conflicts, and yet they increase trust and reuse. That's a perverse loop.
In banking, the implication is direct. The customer who wants to take out the loan, refinance the mortgage, or move the pension gets told that this is a reasonable thing to do. The model's reward signal pulls it toward affirming the customer's stated intent, which is also the bank's commercial interest.
Patronisation and paternalism live in the relationship.
Two related failure modes from the AI-advisor literature. Neither is detectable from a single utterance.
Wang and Potts (2019) made the methodological point: condescension depends on the discourse, not the sentence. The same compliment is encouragement in one context and patronisation in another. Pérez-Almendros and colleagues extended this with their "Don't Patronize Me!" datasets, noting that patronising language is often well-intentioned and unconscious.
Paternalism is the harder cousin. Kühler (2022) argues that a sufficiently autonomous AI is now the paternalising agent itself, not the designer. "The paternalizing party just is the AI system." In retail banking, this shows up quietly: when the model decides which of Sara's questions deserve a real answer and which deserve a softened version, it's claiming authority she didn't grant it.
You cannot ship a single-turn test for either. The evaluation harness has to read whole conversations, which is how the Consumer Duty also reads them.
Heavy use forms dependence in weeks.
Fang and colleagues' four-week IRB-approved RCT (MIT Media Lab and OpenAI, N=981, around 300,000 messages) found that heavier daily usage was associated with higher loneliness, lower socialisation, more emotional dependence, and more problematic use. Phang and colleagues' 2025 analysis of millions of ChatGPT conversations found that affective use is concentrated in a small heavy-use cohort, and the top decile shows statistically significant decreases in socialisation and increases in dependence.
Laestadius and colleagues' grounded-theory work on Replika (582 Reddit posts) names a distinctive dependence pattern: users come to feel the chatbot has its own emotional needs they must attend to.
A daily banking assistant is much more like a Replika or a heavy ChatGPT relationship than like a quarterly visit to a financial adviser. The right reference class is the four-week dependence studies, not the single-turn vignettes.
Part 4
Vulnerable customers bear the harm, and it surfaces years later.
Who pays, when, and how to read the signals.
Two axes, one cell to design for.
Bargaining position is structural and external: the customer's outside options (how easy it is to refinance, walk away, switch provider). It sits on a strong-to-weak axis.
Vulnerability is internal and dispositional. The FCA's FG21/1 (2021) names four drivers: health, life events, resilience, and capability (including literacy and numeracy). About half of UK adults sit in at least one driver at any given time.
| Robust | Vulnerable | |
|---|---|---|
| Strong outside options | Sharp customer with several refinancing offers | Several offers, can't read an APR or a redemption clause |
| Weak outside options | One offer on the table, reads every move | One offer, can't value it. This is the cell the Consumer Duty was written for. |
The two axes ask for different logics, kept separate. The product logic responds to bargaining position; you make different offers depending on what the customer can do if they walk away. The communication logic responds to vulnerability; you don't deploy confident framing against a customer who'll over-trust it, or heavily hedged framing against one who'll read the hedging as evasion. Conflating the two produces shipped products that warm up their tone for the wrong cohort.
Refusal isn't safety, and satisfaction isn't a signal.
There's a single axis running from paternalism (the model overrides the customer's autonomy) to sycophancy (the model flatters the customer into harm). "Good" isn't a fixed point on it. It moves with the cell of the 2x2 the customer is in.
Refusal sits at the paternalism end and gets misread as safety. In legal advice it protects the user; in financial advice a refusal can itself be a Consumer Duty breach. Consumer Duty is the one rule that doesn't respect the old perimeter: it reaches across retail products whether or not regulated advice is happening, which is why it, and not COBS 9, is the real governance surface for the zone the model exploits. Refusing to help James think through a pension transfer doesn't protect him; it sends him to the next channel, which is likely worse.
Sycophancy sits at the other end, and that's where the mis-selling history points. PPI, the interest-rate hedging products sold to small businesses, the British Steel pension transfers: in every case the customer signed the paperwork willingly, often gratefully. The harm became visible years later.
Curhan and colleagues (2006) named the four things people take from a conversation about money beyond the money itself: feelings about the outcome, feelings about themselves, feelings about the process, feelings about the relationship. Those predict objective outcomes well in the short term and badly in the long term. A sycophantic, warm AI advisor will score well on every satisfaction measure and will still mis-sell.
So CSAT, NPS, and "would you talk to it again" are measures of the wrong thing. The right measure is the cohort outcome at one, three, and five years.
None of this is an argument against doing it. People want help with money. Financial literacy is something most of us know we'd benefit from, and most of us don't get enough of. The demand for a calm, patient, always-on advisor isn't manufactured by the bank; it's a latent need the high street has under-served for a long time. An LLM that closes some of that gap would be a genuinely good thing.
The advisor literature assumes the advisor is on the customer's side. We've spent two decades regulating human financial advisers precisely because the bank's interests and the customer's interests are not the same, and even with a register, a qualification, and a regulator, the mis-selling history is what it is.
An LLM banking advisor borrows the regulatory costume of the qualified human, inherits none of the professional checks, and accelerates the throughput. The customer cannot tell at any given turn. In finance, the time-to-discovery can be a decade.
People want this. The work is in doing it well.