Smell (AI Smell)

An AI smell is a surface pattern in model-generated output that suggests the content was produced for plausibility rather than understanding; the word is what lets a reviewer name what they’re looking at before they can prove the output is wrong.

Concept

Vocabulary that names a phenomenon.

What It Is

An AI smell is a recognizable shape in model-generated text or code that hints the model was pattern-matching rather than reasoning about the specific problem in front of it. The output reads fluently. The structure looks right. The conventions are observed. Something is still off, and a practiced reader can feel it before they can articulate it. The smell is the something is off: the diagnostic intuition that earns the reader’s next minute of scrutiny.

The lineage is direct. Kent Beck named code smell in the late 1990s as a deliberate metaphor: not “this code is broken” but “this code points at a deeper problem worth investigating.” A long method, deep nesting, duplicated literals: none of these are bugs on their own; each is a surface indication that something underneath is shaped wrong. AI smell extends that metaphor to model output. A plausible function name that doesn’t exist, a symmetric three-bullet list whose three bullets don’t actually distinguish three things, an error handler that catches and re-throws without doing anything: none of these prove the output is wrong, but each raises a flag worth investigating.

The phenomenon is new in one specific way. Code smells are about the structure a human author chose. AI smells are about the production process the model used. The shapes that show up (fluent prose with hedged commitments, parallel structures that are decorative rather than informative, confidently named identifiers the rest of the codebase has never heard of) are the visible residue of next-token prediction running over a training corpus. The reader who’s read a lot of AI output learns to spot the residue.

It pays to keep three nearby ideas separate, because they get conflated and the conflation is where bad reviews happen:

An AI smell is a property of the output. It says: this artifact looks like it was generated for plausibility. It motivates verification.
An AI tell is a property of the style — em-dashes, “in conclusion,” tripled adjectives, the particular cadence of model prose. Tells point at authorship, not correctness. A polished AI tell can sit on top of completely correct output.
An agent struggle signal is the inverse of the smell. When the model repeatedly fails on a particular module, the failure is a property of the codebase, not the model. The struggle is a smell that the human-authored code resists local reasoning.

The three look similar at a glance and they need different responses. A smell asks for verification of this specific artifact. A tell asks (mostly) for editorial cleanup or for context about the author. A struggle signal asks for a refactor of the code the agent kept failing on.

Why It Matters

The reason the word AI smell exists is that fluency has gotten cheaper than correctness. A model that has read a billion lines of code can produce a thousand of them on demand, formatted to local convention, named like a senior engineer would name them. The result reads like a finished artifact. The reviewer’s job, the part that hasn’t been automated, is to notice when the finished artifact is finished-looking rather than finished. Without vocabulary for the noticing, the team falls into one of two failure modes: they trust the output uniformly (and ship subtly wrong work into production) or they distrust it uniformly (and lose the productivity the model was supposed to deliver). Neither is the calibrated middle the team needs.

The cost of not having the vocabulary is concrete. A developer reviews an agent-generated API client, sees clean type annotations and a coherent class layout, and clicks approve. Three of the endpoints don’t exist, two of the request bodies are missing required fields, and the auth header is in a format the API doesn’t support. The reviewer’s eye registered “professional-looking client” and stopped there. With the smell vocabulary they would have registered “plausible references,” a specific category that triggers a specific check (compare every endpoint URL and field name against the actual documentation), and the bad merge wouldn’t have shipped. The cost wasn’t of generating bad output. It was of not having a quick mental tag for the kind of bad output.

For teams the vocabulary is also social. A reviewer who pushes back on an AI-assisted PR needs language that doesn’t sound like “I think your prompt was bad” or “you should have written this yourself.” AI smell is exactly that language. It names a property of the output, not of the author. It says: this particular artifact has a shape that asks for verification, regardless of who or what produced it. Teams that adopt the word find the conversation gets easier. The reviewer isn’t accusing the author of being lazy; the reviewer is naming a smell and asking for the verification the smell calls for.

The deeper move the word does is mark the limit of self-review. The most reliable AI smells are exactly the ones the model is least equipped to spot in its own output, because the smell is the model’s default mode. Asking the model to check whether its work pattern-matches from training is asking it to use the same machinery that produced the work; the second pass will be as confident as the first. The reviewer who detects AI smells is doing work the agent can’t do for itself, and that’s why the role exists. Treating AI smell detection as a human capability, like code smell detection, is what makes the verification loop load-bearing rather than ceremonial.

How to Recognize It

You’re looking at an AI smell whenever the output’s confidence and the output’s evidence don’t match: when the writing or code reads more sure of itself than the underlying material would warrant. The most useful smells have specific shapes a practiced reader can pick out fast.

Plausible but fabricated references. The model names a function, library, configuration option, command-line flag, or API endpoint that follows the naming conventions of real ones but doesn’t actually exist. The smell signal is that you can’t immediately recall the thing being referenced, and a quick search comes up empty or returns a similarly-named-but-different thing. This is the canonical hallucination shape and the easiest to verify: grep, search the docs, run --help. The check costs a minute and catches the majority of fabricated references before they reach a teammate.

Symmetry without substance. The output produces a beautifully parallel structure — three bullets, each with the same template — but the three items don’t actually illustrate three different things. They illustrate the shape of three different things, which is a different claim. The smell signal is that you can swap two of the bullets without changing the section’s meaning. Real lists have an order that matters. Decorative lists don’t.

Confident hedging. Phrases like “this is generally considered best practice,” “most developers agree,” “in most cases this approach is preferred,” or “industry consensus is that…” — language that sounds authoritative but commits to nothing falsifiable. The smell signal is that you can’t replace the hedge with a specific person, study, or context without the sentence losing its claim. Real authority names the source. Confident hedging averages across training data.

Cargo-cult patterns. The output applies a design pattern (dependency injection, observer, middleware chain, repository, adapter) because the pattern is common in similar codebases, not because the current problem requires it. The smell signal is that the pattern is structurally present but doesn’t seem to be solving any problem the simpler form wouldn’t already solve. See YAGNI and Speculative Generality: the agent’s version of these traps is the same trap, generated faster.

Shallow error handling. The output wraps operations in try/catch blocks, adds error-return paths, or attaches .catch() handlers, but the handling logic is generic: log the error and re-throw, return a default value that’s never going to be correct, swallow the exception silently. The smell signal is that the error handling tells you nothing about what error was anticipated or how the system should recover. Real error handling is specific: this exception means this thing went wrong, here’s why we expected it, here’s the recovery. Generic handling suppresses errors rather than handling them.

Tests that mirror the implementation. The output writes tests that look thorough — multiple cases, good coverage on paper — but the tests pass because they assert the same logic the implementation runs, not the requirement the implementation was supposed to meet. The smell signal is that the tests would also pass if the implementation were subtly wrong, as long as the wrongness were consistent. Real tests anchor to a specification or expected behavior the reader can articulate without looking at the implementation. Mirror tests anchor to the code, and verify only that the code does what the code does.

Unreviewed output passed straight to a teammate. A developer takes whatever the agent produced, glances at it for ten seconds, and opens a pull request. The teammate on the receiving end now has to understand code the author never understood. This is a team smell, not an output smell — the code might even be correct — but the author can’t answer a single question about why it’s structured the way it is, and the reviewer’s time absorbs the cost the author didn’t pay. You’re the agent’s editor before you’re anyone else’s author; don’t pass on work you wouldn’t vouch for.

Agent struggle as a code-quality signal

The smells above are properties of the agent’s output. There’s an inverse worth knowing, because it works the other direction. When the model repeatedly struggles with a particular module (keeps misunderstanding the control flow, keeps introducing the same class of bug, keeps asking clarifying questions about the same area), the struggle itself is a signal about the code, not about the model.

Modules with poor Local Reasoning properties (hidden state, implicit conventions, tangled dependencies, mysteriously named variables, naming that survived three refactors past the system that motivated it) trip up new team members and trip up agents in the same way. The new team member at least has a slack channel to ask in; the agent guesses, and the guess goes into a PR. A codebase where agents perform consistently well is usually a codebase where humans perform well too. A codebase where the agent keeps failing in one specific place is signaling that the place needs a refactor, not that the model is bad.

This reframes the post-mortem on a failed agent task. Instead of asking “why is the agent so bad at this?” the question becomes “what is it about this code that resists being worked on?” The first question has no good answer (the model is what it is). The second question has many good answers, and most of them are improvements the codebase should have had anyway.

Warning

The most dangerous AI smell is code that works perfectly for the tests the agent generated alongside it. The tests were written from the same understanding the implementation was written from, so any blind spot in the implementation is mirrored in the test suite. Always anchor at least a few tests yourself, written before the agent runs, against the requirement the feature is supposed to meet. Those are the tests the agent can’t accidentally satisfy with mirrored work.

How It Plays Out

A developer asks an agent to integrate with a third-party billing API. The agent produces a clean client class with methods for every endpoint, type definitions for every request and response body, and a tasteful retry wrapper around the network calls. The developer skims the diff, notices the structure matches their team’s house style, and approves. A week later the on-call engineer is debugging why no charges have actually posted. The base URL was for the API’s marketing-microsite domain, not the API itself. Two endpoints don’t exist. The auth header is sent as X-Api-Key instead of Authorization: Bearer. None of it was malicious or even particularly hard to spot; the developer’s eye registered “professional-looking client” and didn’t look further. The smell that was present and missed: plausible-but-fabricated references. The fix in the codebase was a half-day refactor; the fix in the team’s process was to add “compare every endpoint URL and field name against the API documentation tab” to the AI-assisted-PR review checklist.

A team adopts an agent for documentation. Every function in the public API gets a docstring overnight. The docstrings are fluent, well-formatted, and follow the same template: “This function takes X and returns Y. It handles Z errors gracefully.” A senior engineer reads a few of them and notices something: the docstrings restate the function signatures and add no information. “Returns the list of users” tells the reader nothing they couldn’t derive from def get_users() -> list[User]. The smell present and named: symmetry without substance. The agent produced text that satisfied the structural requirement (every function has a docstring) without satisfying the substantive requirement (the docstring should tell a reader something the signature doesn’t). The team’s correction wasn’t to ban agent-written docs; it was to add a specific check to the prompt, “every docstring must include either an example, an edge case, or a non-obvious precondition that the signature alone doesn’t reveal,” and to spot-check that the check held.

A platform team notices that the agent keeps producing broken code in their billing module specifically. Every modification needs multiple correction cycles. The instinctive read is that the agent is bad at billing. A new hire reports the same experience independently. The team investigates the module itself and finds: a configuration value whose meaning changes depending on which day-of-month it’s read, three implicit couplings to systems that were retired in different years and aren’t mentioned in any documentation, a function named process_invoice that does five distinct things depending on a flag in the third argument. The signal that was being read backwards: the agent’s struggle was an agent-struggle signal, not an output smell. The agent was a fast reader of code that the codebase had quietly decided to make unreadable. The work was a refactor of the billing module (eight days, two engineers), and after the refactor the agent produced clean billing changes on the first try. The agent didn’t get better. The code did.

Example Prompt

“Review the API client you just produced. For every endpoint, request field, and authentication header in your code, confirm that the documentation I shared explicitly mentions it. Flag every value that you inferred from naming conventions rather than read from the docs, and either remove it or annotate it as unverified.”

Consequences

Naming the AI smell as a category of signal, distinct from the broader “model is bad at this” judgment, changes what the team’s review investment is for. Reviewers stop reading agent-assisted PRs as if they were normal human PRs (where the question is “do I trust this author’s judgment?”) and start reading them as if they were submissions to a journal (where the question is “what claims are being made, and which ones have been verified?”). The two reading modes ask for different things, and naming the smell is what flips the mode.

Benefits. A team that has internalized AI smells reviews faster, not slower. The smell vocabulary gives the reviewer specific things to look for, and the specific things have specific checks — grep for the function name, scan the test for what it actually asserts, read the error handler and ask what concrete error it expected. The reviewer doesn’t have to re-read the whole PR with maximum suspicion; they have to run the checks the smells call for. The signal-to-noise of review time goes up. Authors of AI-assisted PRs learn to run the same checks before they open the PR, so the smells get caught in the author’s own pass and the review backlog shrinks. The team’s calibration between trusting and verifying becomes a written discipline rather than a felt mood, and over months the felt mood follows the discipline.

Liabilities. Smell detection has a cost the agent’s speed advantage was supposed to retire. A team that’s serious about review pays back some of the productivity the model delivers. The right calibration isn’t zero — a team that does no smell-checking ships subtle bugs into production and pays for them downstream, which is always more expensive than catching them in review. But the right calibration also isn’t to read every line as if it were a hostile submission. Teams that overshoot the discipline end up doing manual review of every character the agent wrote, which is roughly equivalent to having written it themselves, only with worse motivation. The investment that pays is in which smells the team checks for and how fast the checks are. Cheap, specific checks (does this function name resolve? do these endpoint URLs exist?) earn their cost on every PR. Expensive, vague checks (does this design feel right?) burn time without producing decisions.

There’s a social dimension that doesn’t go away. The norm a team needs is that questioning AI output is part of the job, not a failure of the prompter. AI smells are inherent to how models work; finding one isn’t evidence of bad prompting, just as finding a code smell isn’t evidence of a bad engineer. But the author of a change still owns it. The agent isn’t a co-author the team can blame when a review goes badly; it’s a tool whose output passes through the author. A team where “the agent wrote it” becomes an excuse for unreviewed code is already past the point where the smell vocabulary will save them, because the smell vocabulary works only when somebody is actually reading.

For agentic coding the shape of the discipline matters. The smells aren’t a static list; the model’s output evolves, new shapes show up, old ones get suppressed by post-training. A team that treats the smell vocabulary as a living document, revisited periodically to check what new shapes have shown up in the last quarter’s reviews, will keep the vocabulary current. A team that learned the smells once in 2025 and never revisited the list will, by 2027, be checking for the wrong things while a different category of smell ships past them. The investment is small (a half-day every few months); the savings are large.

Sources

The word smell — as in “surface symptom that points at a deeper problem worth investigating” — is owed to Kent Beck, who coined code smell in the late 1990s; the canonical write-up is the smells chapter in Martin Fowler’s Refactoring: Improving the Design of Existing Code (1999), where Fowler attributes the metaphor to Beck and catalogs the original list (Long Method, Large Class, Duplicated Code, and the rest). This article extends that metaphor to model-generated output, but the move — naming surface symptoms before naming root causes, so the reviewer has something to look at before they have a diagnosis — is Beck’s and Fowler’s.
Wikipedia editors compiled “Signs of AI Writing” (2025), a working field guide built up across thousands of edits and review threads. Many of the specific shapes named here — confident hedging, symmetry without substance, fluent-but-generic prose — align with patterns documented there. The guide’s contribution is the catalog of prose tells; this article borrows the catalog and extends it to code output.
Adam Tornhill and the CodeScene team published “AI-Ready Code: How Code Health Determines AI Performance” (2026), reporting empirically that AI agents produce more defects in unhealthy code than in healthy code, and that the difference is large. Their measurements support the “agent struggle as code-quality signal” framing: when agents fail repeatedly in a module, the module’s structural health is usually the root cause, and the agent’s struggle is a faster-to-observe readout of debt that experienced engineers had quietly learned to work around. The remedy their results recommend — refactor the resistant module rather than blame the model — is the same remedy this article names.
The underlying production framing — that a language model’s output is the residue of next-token prediction running over a training corpus, and that the shapes of the residue are diagnostic — runs through the broader practitioner conversation around production-grade agent loops. There’s no single originating work; the discipline of reading model output as production residue rather than as authored prose is something a generation of reviewers picked up empirically.

Keyboard shortcuts