The bug Claude wrote. Seven days in production.
The code looked correct. Tests passed. CI was green.
I approved the pull request.
Seven days later, we found a bug in production — in exactly the code Claude wrote in eleven minutes.
This isn't a story about AI writing bad code. It's about how our attention shifts when AI writes the code instead of a human — and why that shift is more dangerous than it looks.
What Claude wrote — and what I approved
The task was specific: write a cursor-based pagination utility for the Bitrix REST API, used by a headless Next.js frontend. Cursor-based to avoid offset drift on a live catalog.
Claude produced it fast. The logic looked clean: accept cursor and page_size, return the next cursor and a result array, return null at the end. Tidy naming, proper types.
The tests were also Claude's. Three scenarios: 25 items with page_size=10, 7 items with page_size=20, 100 items with page_size=30. All green.
I read the code. Cursor logic — correct. API calls — right. Types — aligned. I approved. Ticket closed.
How we found it: logs, not tests
A week later, a user complaint: they reached page 12 of the catalog and saw an empty product list instead of the last items.
I opened the logs. A cluster of requests hitting /catalog?page=13, all returning empty arrays. The frontend wasn't distinguishing "empty last page" from "no data" — it showed a "no results" placeholder.
The client's catalog: 240 products, page_size=20. Exactly 12 pages. Not one product more.
The bug: the end condition was written as cursor >= total. When cursor hit exactly 240, the condition triggered correctly — but the function already returned an empty array on that last step instead of null. The frontend had no way to tell the difference.
If the catalog had 241 or 239 products, this never would have surfaced. The exact divisibility was the trigger.
None of the three test scenarios used a multiple. 25, 7, 100 — all non-divisible. This wasn't random. The same model that wrote the code wrote the tests.
Why it passed review: the attention mode shift
AI-generated code changes how engineers review pull requests — not because the code is worse, but because it looks more polished than average human code.
When a colleague writes code, I know where to look. One person reliably misses off-by-one errors. Another tends to have async ordering issues. I don't just review "is this correct?" — I review "is this correct given this specific person's patterns?"
When Claude writes code, I reviewed it differently. The code was well-structured. Variables named clearly. No obvious dead weight. I checked whether the logic was sound — not whether the edge cases were covered.
My brain concluded: "looks right, probably right." And that conclusion got through. Because AI-generated code tends to look more polished than average human code. That's exactly what makes it more dangerous to review.
There's a name for this: automation bias. The tendency to trust automated systems more than they deserve. With AI-written code, we haven't accumulated decades of experience to calibrate appropriate skepticism. We're still pattern-matching on surface quality.
Logic vs. edge cases: a different review mindset for AI code
Standard code review checks two things: does the core logic work, and are the edge cases handled.
AI is good at the first one. The cursor pagination logic was correct for the typical path. API structure and types matched what the frontend expected.
But AI optimizes for the context it's given. If the prompt didn't explicitly include "what happens when total is a multiple of page_size?" — the model didn't generate a test for it. That isn't a model failure. It's a constraint you have to account for in how you review.
A good AI code reviewer shouldn't ask "is the logic correct?" They should ask "what inputs could break this correct-looking logic?"
That's a different question. And it has to be asked explicitly.
What we changed in the protocol
Reviewing AI-generated code safely requires two explicit steps that don't apply to human-written code review: a boundary condition checklist and a direct question to the model.
After this incident, we added both steps to our workflow.
First: an explicit edge case list before I approve. I literally write it out: empty array, zero items, maximum, exact multiple, single item, type overflow. I check that at least some are covered in tests — not because the AI wrote bad tests, but because I need to set the right context.
Second: an explicit question to the model. In the same chat, before closing: "List the edge cases that could break this code. Which ones aren't covered by the tests you wrote?" Claude answers honestly. It usually names the boundary conditions itself — if you ask directly.
These two steps add three to five minutes. They don't meaningfully slow development. But they remove a whole class of bugs from reaching production.
I still let AI write the tests. But I dictate the scenarios for anything critical.
When this is fine
The pagination bug wasn't severe. No data loss. Users saw an empty page instead of products. Found in seven days, fixed in thirty minutes.
AI-written code in production is normal at this point. The real question isn't "did a human or an AI write this?" It's "who owns the edge cases?"
When I handed Claude the task and approved without checking boundary conditions explicitly, the responsibility stayed with me. Not with the model. A tool behaves according to how you use it.
The rule I have now: AI writes the code, I specify the edge cases. It takes longer than just clicking Approve. But that's the actual review.
Seven days in production is a reasonable price for learning this once. The point is it doesn't happen again.
I write about working with AI in real production environments, headless architecture, and Bitrix on ivanpin.com. See I let Claude write tests. I don't let it choose architecture. — on where AI helps and where it doesn't.