Back to blog

Why I Ban LLMs from Client-Facing Features

I run 13 autonomous Claude tasks in my studio. They write LinkedIn posts, draft blog articles, run monitoring, send Telegram alerts. I trust them — and I sleep fine. Because if Claude writes something wrong in my name, I see it before it matters. But if it wrote that in response to a customer's question about a return policy on my client's e-commerce site? Different story. That's why I have a rule: no LLMs on client-facing features. Not yet.

When a product manager says "let's add AI," they usually mean something specific: a support chatbot, a smart search assistant, AI-generated product descriptions rendered in real time. All of these tools speak directly to a live customer, on behalf of my client's business. I draw a hard line: anything that communicates with a buyer in real time stays behind that line. LLMs don't cross it.

What "client-facing" actually means

Client-facing is any component that generates content or makes decisions visible to an end customer — content that directly affects their trust in the brand or their money.

Examples: a support chatbot, an AI response to "where's my order," an auto-generated product description rendered without human review, a recommendation engine that says "this product is right for you because..."

Internal is everything else: draft generation that a human reviews before publishing, ranking signals that sort results without explanations, operational automation that only I and my team ever see.

The line isn't about technology. It's about who sees the output and whether the business is accountable for its content to someone outside the company.

Internal AI vs customer-facing AI: three differences

The core distinction between internal AI automation and customer-facing AI is who bears the consequences of an error — and how quickly they can be corrected.

When AI makes a mistake inside my system, I see it in the logs. I've got a STOP file, JSONL logs, Telegram status notifications for each task. If an article comes out with the wrong tone, I catch it before it goes anywhere. The cost of the mistake is my time.

When an LLM makes a mistake in front of a buyer, the buyer already has the answer. Maybe a wrong delivery date. Maybe a promised discount that doesn't exist. Maybe confident instructions about returns that contradict the actual policy. The cost is someone's trust — and they might never come back.

Observability. My internal tasks give me full visibility: what Claude received, what it returned, what decision it made. A customer-facing chatbot at scale means thousands of conversations I can't possibly read one by one.

Reversibility. A wrong draft article? Edit it. A wrong answer to a buyer? They've already read it — and it may already be grounds for a complaint.

Accountability. When I make a mistake internally, it's my mistake. When an LLM makes a mistake in my client's name, who's responsible? Technically, the developer who deployed it. Practically, the business takes the reputational hit. Those aren't the same weight class.

The "perfect liar" incident pattern

In 2026, a wave of posts hit Hacker News and Habr from teams that had put LLMs into e-commerce support and discovered what one author called the "perfect liar" pattern. The model hallucinated with total confidence. It invented delivery dates not found in the database. Promised discounts that didn't exist. Return instructions it gave directly contradicted the store's actual policy. Customers received answers that sounded like the company's official position.

This isn't a bug in a specific model. It's how LLMs work: they generate plausible text, not verified information. "Plausible" and "correct" aren't the same thing. For a question about delivery timing, a plausible answer can be catastrophically wrong.

The teams in those stories rolled back within one to three weeks. The uncomfortable part: they didn't always catch every error. Some customers just left quietly, no complaint filed.

Why prompt engineering doesn't solve this

LLM hallucinations are probabilistic events, not deterministic bugs — which means prompt engineering reduces their frequency but cannot eliminate them entirely.

The standard counter-argument: "we'll write a tight system prompt, limit the scope, give the model only verified data via RAG." I hear this a lot. It's a reasonable path — but it moves the problem, it doesn't remove it.

Hallucinations in LLMs are probabilistic. Good prompting and solid RAG cut them significantly. But "significantly less" at thousands of conversations per day still means several wrong answers. For a standard FAQ, the cost might be acceptable. For pricing, returns, or warranty — each wrong answer is potentially a legal or reputational liability.

My 13 internal Claude tasks also make mistakes sometimes. An article draft has an imprecise phrase; a task runs differently than I planned. That's fine — I'm in the loop. But applying that logic to the customer side means accepting that n% of buyer conversations will contain an error. That's a decision I'm not willing to make on behalf of a client.

Where AI can safely touch the customer experience

This doesn't mean AI has no place in e-commerce. There are zones where it works without direct general reasoning:

  • Ranking without explanations. Elasticsearch with behavioral signals, an ML model for ordering search results — that's not an LLM explaining to a buyer "why this product is right for you." Signal → weight → order. No text generation, no hallucinations.
  • Weight personalization. A model deciding whether to show a banner — no explanations, no dialog. Binary, deterministic decision.
  • Internal content preprocessing. Product description drafts that a merchandiser reviews and edits before publishing. AI accelerates the human; it doesn't replace them in the buyer conversation.

In all three cases, AI isn't talking to the customer. It's processing data inside the system, and the buyer sees the result in deterministic form — no LLM generation happening in real time.

The readiness checklist: when I'd change my rule

I'm not saying never. I'm saying not yet. Here are the criteria that would change my approach:

  1. Full conversation observability. Not a sample — every dialogue logged and a system for reviewing them. Below 100 conversations per day, this might be feasible.
  1. Limited scope with a deterministic fallback. The LLM answers only questions from a verified knowledge base; if confidence is below a threshold, it hands off to a human. Not "tries to answer anyway" — it honestly says "I don't know, please contact our team."
  1. No real-time access to transactional data. Not prices, not inventory levels, not order statuses — until those data sources are verified by an external system with guaranteed consistency.
  1. The client understands and accepts the risk. Not "we launched AI" in a press release — actual understanding that 0.5% of conversations may contain an inaccuracy, and their business model can absorb that.
  1. There's a process for handling consequences. What happens when the LLM is wrong? Who handles the complaint? How fast?

Until a client has clear answers to all five, the rule stands. Not out of fear of AI. Out of respect for what I know: the difference between "my mistake" and "a mistake in front of someone else's customer" isn't technical. It's ethical.

Inside the studio, I'll keep deploying Claude on everything that makes sense. Thirteen tasks will become twenty. But the line stays in the same place: an approval gate on anything a customer sees.


*Related posts: Why I keep human approval gates even when I could remove them · How I trust AI agents in production · The bug Claude wrote*