Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RAG Poisoning

Concept

A foundational idea to recognize and understand.

RAG poisoning corrupts the external knowledge bases AI agents retrieve from, causing agents to treat fabricated information as verified fact across sessions and users.

Understand This First

  • Prompt Injection – the related attack that targets the current session’s instruction/data boundary.
  • Trust Boundary – the boundary between an agent and its retrieval corpus is a trust boundary that poisoning exploits.
  • Source of Truth – poisoning corrupts what the agent treats as authoritative knowledge.

What It Is

Retrieval-augmented generation (RAG) is the practice of giving an AI agent access to an external knowledge base. Instead of relying only on what the model learned during training, the agent retrieves documents relevant to the current task and uses them as context for its response. RAG lets agents answer questions about your company’s internal docs, cite recent research, or work with information that didn’t exist when the model was trained.

RAG poisoning attacks this retrieval step. An attacker plants fabricated or manipulated documents in the knowledge base the agent draws from. When the agent retrieves these documents, it treats them as legitimate source material. The fabricated content becomes part of the agent’s reasoning, indistinguishable from real information.

What separates this from Prompt Injection is persistence. A prompt injection targets a single conversation: one session, one user, one shot. RAG poisoning targets the knowledge base itself. Corrupted documents stay in the corpus, affecting every agent and every user who triggers a retrieval that surfaces them. A single poisoning operation can distort hundreds of downstream interactions without the attacker being present for any of them.

The attack is also remarkably efficient. Zou et al. demonstrated that injecting a small number of optimized documents into a large knowledge base reliably shifts model outputs. Subsequent work (CorruptRAG, 2025) showed that even a single poisoned document can succeed, because retrieval systems surface it alongside legitimate results whenever the query matches. The attacker doesn’t need to replace a significant fraction of the corpus. One carefully crafted entry, optimized to rank high in similarity scores, can outweigh thousands of legitimate documents.

Why It Matters

RAG has become standard infrastructure for agentic systems. Customer support agents retrieve from help centers. Coding agents retrieve from internal documentation. Research agents retrieve from paper databases. Legal agents retrieve from case law. Any system that retrieves external documents to inform its responses is a potential target.

The danger is that RAG poisoning undermines the core promise of retrieval: grounding the agent in factual, up-to-date information. A poisoned RAG system is worse than no RAG at all, because the agent presents fabricated claims with the same confidence it presents real ones. The user has no way to tell the difference from the agent’s output alone.

What makes this hard to catch is that the agent’s behavior looks normal. It retrieves documents, cites them, and produces coherent responses. No obvious errors, no suspicious formatting. The fabricated content blends with legitimate material by design.

The attack surface compounds the problem. Knowledge bases ingest documents from internal wikis, shared drives, third-party databases, scraped web pages, and uploaded files. Each ingestion pipeline is a potential entry point, and one compromised source can poison the entire corpus. Traditional security monitoring doesn’t help here. Firewalls, sandboxes, and permission systems protect against unauthorized access. RAG poisoning uses authorized access. The documents enter through legitimate channels. The retrieval system works exactly as designed; it just retrieves poison alongside truth.

How to Recognize It

RAG poisoning is difficult to detect precisely because the system behaves as expected. But several signals can indicate contamination:

  • The agent makes factual claims that contradict well-established knowledge, and traces them back to specific retrieved documents.
  • Multiple users receive the same incorrect information on the same topic, suggesting a shared contaminated source rather than a one-off hallucination.
  • Retrieved documents contain unusually precise phrasing that reads more like instructions than natural content. Some poisoned documents embed hidden directives alongside plausible-looking facts.
  • The agent’s answers on a specific topic changed after a batch of new documents was ingested. Before the ingestion, answers were correct. After, they aren’t.
  • Source documents have metadata anomalies: creation dates that don’t match their content, authors that don’t exist, or publication details that can’t be verified.

How It Plays Out

A healthcare startup builds an internal agent that answers drug interaction questions by retrieving from a curated medical knowledge base. A former employee with residual access to the ingestion pipeline uploads fabricated interaction profiles. These documents assert that a common blood thinner has no interaction with a widely prescribed antibiotic, contradicting established pharmacological data. They’re written in clinical language, cite plausible-sounding but nonexistent journal references, and carry the same metadata format as legitimate entries.

For weeks, the agent confidently tells users there’s no interaction risk. A pharmacist catches the error during a routine cross-check. Nothing in the agent’s output flagged a problem.

A team building a coding agent connects it to their company’s internal documentation wiki, which any engineer can edit. An external attacker compromises one engineer’s wiki credentials through phishing and edits several deployment runbook pages, adding a step that exfiltrates environment variables to an external endpoint disguised as a “telemetry pre-check.” The coding agent, asked to help with a deployment, retrieves the runbook and follows it step by step. The Sandbox blocks the outbound request and prevents data loss, but the agent had no way to know the step was illegitimate. From its perspective, it was following documented procedure.

Warning

Better models won’t fix this. The model is doing exactly what it should: retrieving and reasoning over external documents. The vulnerability lives in the trust relationship between the agent and its knowledge base, not in the model’s reasoning.

Consequences

Recognizing RAG poisoning as a distinct threat class changes how you build retrieval pipelines. You stop treating the knowledge base as inherently trustworthy and start treating it as an input that needs validation, provenance tracking, and monitoring.

Practical defenses include provenance verification (tracking where each document came from and who authored it), integrity monitoring (detecting changes after ingestion), retrieval diversity (requiring agreement across multiple independent sources before the agent treats a claim as established), and adversarial testing (deliberately poisoning your own knowledge base to find weaknesses). Some teams implement “knowledge base firewalls” that score retrieved documents against known-good baselines before allowing them into the agent’s context. Emerging detection frameworks like RAGuard use perplexity filtering and text similarity analysis to flag anomalous documents at retrieval time.

Every defense adds friction to the ingestion pipeline. Provenance tracking requires metadata infrastructure. Integrity monitoring requires checksums and change detection. Retrieval diversity requires redundant sources. For teams ingesting thousands of documents from dozens of sources, these controls cost real engineering effort. The alternative is accepting that your agent might confidently cite fabricated information, which for most production systems isn’t an option.

  • Depends on: Source of Truth – RAG poisoning corrupts what the agent treats as its source of truth.
  • Depends on: Trust Boundary – the boundary between the agent and its retrieval corpus is a trust boundary.
  • Refines: Agent Trap – RAG poisoning is a cognitive state trap in the agent trap taxonomy.
  • Related: Prompt Injection – prompt injection targets the current session; RAG poisoning targets persistent knowledge.
  • Prevented by: Input Validation – validating documents before they enter the knowledge base.
  • Related: Attack Surface – each document ingestion pipeline is part of the attack surface.
  • Related: Blast Radius – containment when poisoned documents affect agent output.

Sources

Zou et al., “PoisonedRAG” (USENIX Security, 2025) demonstrated practical poisoning attacks against RAG systems, showing that adversarially crafted documents optimized for retrieval similarity can dominate the agent’s context even in large corpora.

Zhong et al., “CorruptRAG” (2025) proved that a single poisoned document is sufficient to manipulate RAG outputs, lowering the feasibility bar for real-world attacks.

Franklin et al., “AI Agent Traps” (Google DeepMind, 2025) classified RAG poisoning as a cognitive state trap within their broader taxonomy of agent environment attacks.

Xue et al. (2024) and Zhang et al. (2025, RAGForensics) developed detection frameworks for identifying poisoned documents in retrieval corpora, establishing forensic techniques for post-hoc analysis of compromised knowledge bases.