AI Code Review Limitations: 3 Blind Spots After 3 Months

AI code review tools are good at what they can see — syntax, patterns, obvious errors. They consistently miss what requires system context: cache dependencies, business logic hidden in refactors, and technical debt that only shows at scale.

Three months ago I started running every pull request through AI code review before looking at it myself. Not because I trusted it. To figure out what to trust.

I kept a running log: what it caught, what it missed, what I found after it. The results were more interesting than I expected. "AI misses context" — that's obvious. The interesting part: it's blind specifically to changes that look like pure technical code but mean something at the system level.

The workflow — no illusions

I use Claude Code and ask it to review changes before I open the diff. Standard prompt: look at the changes, flag potential issues — syntax, logic, security.

This doesn't replace my review. It's a first pass that filters out the obvious.

Three months, 50+ PRs. Small fixes and component refactors across PHP/Bitrix backends and Next.js frontends.

What AI reviews well

An honest list:

Unclosed null-checks — consistently, almost never misses these
Obvious typos in conditions: if ($a = $b) instead of if ($a == $b)
Missing exception handling where it's clearly needed
Style violations (if you give it the codestyle context)
Deprecated method usage when it's visible from the code itself

This is useful. It frees me from the first mechanical pass. But that's the whole list. Beyond this, the blind spots start.

Blind spot 1 — Locally correct, systemically wrong

The most dangerous AI code review limitation: a change that's syntactically and logically valid in isolation, but breaks something at the infrastructure level that the tool can't see.

A developer edits a caching method in Bitrix. The change looks valid: BXClearCache was replaced with a more targeted cache tag. AI reviews it, no issues. Syntax is clean, logic has no obvious holes.

Three days after deploy, users report that prices on the site are "frozen" after a catalog update. The cause: Bitrix composite cache operates at the page level, and targeted tag invalidation doesn't touch the composite cache layer. The page keeps serving from the composite cache with stale data.

AI doesn't see this. Not because it's a bad tool. Because it doesn't have context about how Bitrix composite cache connects to a specific iblock. It sees the method. It doesn't see the infrastructure the method lives inside.

AI code review gave false confidence here. That's worse than no review at all — there's now an assumption someone checked it.

My rule for this zone: anything touching cache, invalidation, or queues is human-only territory. AI doesn't get a vote here.

Blind spot 2 — Business logic packed inside a technical diff

The change looks like refactoring. A variable renamed, a condition extracted into a separate method, a chain simplified. Technically clean. AI approves.

What's actually inside: the condition shifted slightly, and it now fires in a case that was previously handled differently. Specifically: orders with a zero price (gifts, test orders) landed in one status before the refactor, and a different status after. The code got cleaner. The behavior changed. AI didn't look at that.

This isn't an AI failure — it's an architectural blind spot. The tool checks whether it's written correctly. It doesn't check whether the behavior is correct.

The only thing that helps here: when reviewing PRs like this, I explicitly ask myself "did the behavior of this code path change for any edge case?" That question needs to come from me, not from the tool.

Blind spot 3 — Works now, breaks in a year

AI code review approves code that works today but creates technical debt that compounds with scale — hard-coded values, duplicated logic, undocumented dependencies. The code works. Tests pass. AI sees no problems.

But there's a pattern in the code that will hurt the team at scale. A hard-coded environment ID that seems like a local decision right now. Duplicated logic between two modules, each "owning" its piece. An initialization order dependency that nobody documented.

AI doesn't know the roadmap. It doesn't know that in six months this module will be reused in a different context. It sees code at a specific moment and can't evaluate it in motion.

Second-order debt isn't visible in the current diff. You only see it if you know where the system is going. AI doesn't know where the system is going.

I catch these by asking: "If someone has to extend this a year from now, what's going to break them?" AI doesn't ask that.

The rule I derived from three months

AI does the first pass on what it can see: syntax, patterns, obvious errors.

I do the second pass on what the change does to the system.

First question: "is this written correctly?" Second question: "does this behave correctly in the system I'm building?" AI answers the first. Only.

Practically: if AI code review found nothing — that doesn't mean the PR is clean. It means there are no obvious technical issues. My review starts exactly from that point.

Three zones I check immediately after the AI pass:

Anything touching cache, queues, composite layer — I review myself, no exceptions.
Any refactor that shifts a conditional — I check edge cases manually.
Any new hard-coded value — I ask "why here and not in config?"

The tool is useful. Exactly as useful as you understand what it can't see.

Earlier I wrote about why Claude writes tests but doesn't choose architecture — the same boundary, applied to the review process. And if you want to see it from the other side: the bug Claude wrote and didn't catch in review.