Retrieval
Retrieval lets an agent pull relevant information from an external corpus at query time, so it can work with knowledge that isn’t baked into its training weights.
Also known as: RAG (Retrieval-Augmented Generation), Knowledge Retrieval
Understand This First
- Context Window – retrieval’s job is to fill a finite window with the right information.
- Context Engineering – retrieval is one technique within the broader discipline of managing what the model sees.
- Source of Truth – retrieval only works when the corpus is authoritative.
Context
At the agentic level, retrieval is the mechanism that lets an agent answer questions and perform tasks using information it was never trained on. A model knows what it learned during training. Everything that appeared after the training cutoff, everything private to your organization, everything too specific to show up in public datasets — all of it is invisible unless you bring it into the context window.
Retrieval bridges that gap. You maintain a corpus of documents and let the agent fetch relevant pieces at the moment it needs them, instead of retraining the model (expensive, slow, and overkill for most use cases). The agent’s knowledge grows and changes without touching its weights.
Problem
How do you give an agent access to knowledge it wasn’t trained on, without retraining the model or stuffing the entire corpus into the context window?
A developer asks their coding agent to generate a client for an internal API. The model has never seen this API. It can guess at plausible endpoints based on common patterns, but those guesses are hallucinations dressed up as code. The API spec exists in the company’s docs. The model doesn’t know that, and even if it did, the full spec might not fit in the context window alongside everything else the agent needs.
Forces
- Training data has a cutoff. Models don’t know about events, documents, or APIs that appeared after their last training run.
- Private knowledge stays private. Internal documentation, proprietary codebases, and customer data never made it into any training set.
- Context windows are finite. You can’t preload everything the agent might need. You have to pick what matters for the current task.
- Retraining is expensive and slow. Fine-tuning a model on new information takes time, money, and expertise that most teams don’t have for every knowledge update.
- Agents guess when they lack information. A model without the right context doesn’t refuse to answer. It generates something plausible. Plausible is dangerous when it’s wrong.
Solution
Give the agent a way to search an external corpus and pull relevant documents into its context before generating a response. This is retrieval-augmented generation (RAG), and it follows a three-step cycle: retrieve, augment, generate.
Retrieve. When the agent receives a query or encounters a task, the system searches the corpus for documents relevant to the current need. The most common approach is embedding-based search: documents are pre-processed into numerical vectors that capture their meaning, stored in an index, and matched against the query’s vector by similarity. Hybrid search combines this with keyword matching for terms that embeddings handle poorly, like product names or error codes. A re-ranking step can follow, scoring the initial results by finer-grained relevance before passing them forward.
Augment. The retrieved documents are inserted into the agent’s context window alongside the original task. Placement matters: the retrieved text should appear where the model will treat it as reference material, typically after the system instructions and before the specific request. If the corpus returns too much, truncate or summarize to preserve window space for the agent’s own reasoning. Three highly relevant paragraphs outperform twenty loosely related pages.
Generate. The model produces its response using both its training knowledge and the retrieved material. When retrieval works well, the model cites or draws from the retrieved documents rather than falling back on training-data generalizations. This is grounding: the response is anchored in specific, verifiable source material rather than the model’s parametric memory.
When building a retrieval pipeline for a coding agent, index your project’s documentation, API specs, and architecture decision records separately from general-purpose knowledge. A small, focused corpus with high relevance beats a massive one where the signal drowns in noise.
How It Plays Out
A team maintains a microservices platform with 40 internal APIs. They index the OpenAPI specs, README files, and architecture decision records for each service into a retrieval system. When a developer asks their coding agent to write a client for the Orders service, the agent retrieves the Orders API spec, the authentication requirements from the platform README, and an ADR that explains why the service uses eventual consistency. The generated client handles pagination, authentication, and retry logic correctly on the first pass, because the agent worked from the actual spec rather than pattern-matching against public API conventions.
Consider a different case: a customer-facing agent connected to the company’s help center. A customer asks about a billing discrepancy. The agent retrieves the three most relevant support articles, identifies the one that matches the customer’s situation, and responds with the specific steps from that article, including a link to the source. Without retrieval, the agent would have generated generic billing advice that might not apply to this company’s systems at all.
Consequences
Retrieval shifts the knowledge problem from “does the model know the answer?” to “does the corpus contain the right information, and does the retriever surface it?” That’s a different failure mode, and a more tractable one. You can inspect, update, and version a corpus. Training weights are opaque.
Benefits:
- Knowledge stays current without retraining. Update the corpus, and the agent sees the changes on its next query.
- Private and domain-specific information becomes accessible without exposing it during training.
- Responses can be grounded in specific, citable documents. Verifiability goes up.
Liabilities:
- Retrieval quality depends on the indexing pipeline. Poor chunking, stale documents, or a weak embedding model produce irrelevant results, and the model may incorporate them anyway.
- The retrieval corpus becomes a trust boundary. If an attacker can plant documents in the corpus, they can control what the agent retrieves. This is RAG Poisoning.
- Retrieval adds latency. The search step happens before generation, and for large corpora with re-ranking, the delay can be noticeable.
- Developers sometimes treat retrieval as a substitute for good context engineering. Retrieval fetches information; it doesn’t organize, prioritize, or compress it. You still need to manage the context window.
Related Patterns
- Uses: Context Window – retrieved documents compete for space in the window.
- Uses: Context Engineering – retrieval is one technique within context engineering’s toolkit.
- Uses: Source of Truth – the retrieval corpus must be authoritative for results to be trustworthy.
- Contrasts with: Memory – memory persists agent-specific learnings across sessions; retrieval fetches from a shared corpus at query time.
- Contrasts with: Tool – tools take actions and produce side effects; retrieval is read-only information gathering.
- Enables: Verification Loop – retrieved documents provide ground truth the agent can verify its output against.
- Threatened by: RAG Poisoning – the attack that corrupts the retrieval corpus.
- Threatened by: Prompt Injection – retrieved text can contain injected instructions.
- Related: Agent Trap – poisoned retrieval results can lure agents into traps.
Sources
Patrick Lewis and colleagues at Facebook AI Research introduced retrieval-augmented generation in their 2020 paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” establishing the retrieve-then-generate pattern as an alternative to ever-larger parametric models.
Anthropic’s contextual retrieval guidance documented practical improvements to the chunking and re-ranking stages, showing that adding context to individual chunks before embedding them significantly improves retrieval accuracy over naive chunking approaches.
The LlamaIndex and LangChain frameworks popularized RAG as a standard building block for agent applications, providing abstractions for the indexing, retrieval, and augmentation pipeline that made the pattern accessible to teams without specialized information retrieval expertise.