By Ryan McBridein
AI
·

RAG

RAG

Retrieval-Augmented Generation is the single most important pattern in applied LLM engineering, and it's also one of the most over-mystified. Strip it down and RAG is this: before you call the model, you go fetch some relevant text, you paste that text into the prompt, and then you ask your question. That's the whole idea. The "retrieval" is a database lookup. The "augmentation" is string concatenation. The "generation" is a normal LLM call. Everything else — embeddings, vector stores, chunking strategies, rerankers — is plumbing that makes the lookup good enough to be worth doing.

This article is about why you'd bother, how it actually works under the hood, the AWS-flavored version of it (Bedrock Knowledge Bases), and the parts that will trip you up the first time you ship one.

Why RAG Exists

Foundation models have two big problems when you try to use them on your company's data:

  1. They don't know your stuff. Claude was not trained on your internal wiki, your product catalog, last quarter's support tickets, or the PDF a customer uploaded ten seconds ago. It knows general-world text up to its training cutoff, and that's it.

  2. They hallucinate confidently. When a model doesn't know something, it doesn't say "I don't know" — it makes up a plausible-sounding answer in the same calm, authoritative tone it uses for things it does know. This is fatal for any use case where being wrong has consequences.

The two obvious fixes both have problems. Fine-tuning bakes new knowledge into model weights, but it's expensive, slow to update, requires labeled data, and still hallucinates — you've just taught it new things to be confidently wrong about. Stuffing everything into the prompt works for small corpora but blows your context window and your token bill the moment "everything" gets big.

RAG is the third way. Keep the model's weights frozen. Keep your data in a database. At query time, look up only the 5-10 most relevant chunks, paste them into the prompt as context, and let the model answer using that context. Cheap, fast to update (just reindex the docs), grounded in real text, and the model can cite its sources because the sources are right there in the prompt.

The Flow

Every RAG system, however fancy, is the same six steps:

  1. Ingest. Grab the raw documents — PDFs, HTML, Confluence pages, Slack exports, database rows, whatever.

  2. Chunk. Split each document into smaller pieces, typically 200-1000 tokens each, usually with some overlap between adjacent chunks so you don't cut a sentence in half.

  3. Embed. Pass each chunk through an embedding model, which returns a fixed-length vector of floats (think 768, 1024, or 1536 dimensions). Semantically similar chunks produce vectors that are close together in that high-dimensional space.

  4. Store. Put the vectors — plus the chunk text and any metadata — into a vector store.

  5. Retrieve. At query time, embed the user's question with the same embedding model, then ask the vector store for the top-k chunks whose vectors are closest to the question's vector (cosine similarity, usually).

  6. Generate. Build a prompt that looks like Here is some context: [retrieved chunks]. Answer the question: [user question], send it to the LLM, return the answer.

That's it. There are a thousand knobs, but that's the skeleton.

The Pieces That Matter

Embeddings. An embedding model turns text into a vector where "meaning" becomes "location." This is the magic that makes retrieval work — you're not doing keyword search, you're doing semantic search. "How do I cancel my subscription?" will match a chunk titled "Ending your plan" even though the words don't overlap. Bedrock offers Titan Embeddings and Cohere Embed; OpenAI has text-embedding-3; there are plenty of open-source options too. The critical rule: use the same embedding model for indexing and querying. If you embed your docs with Titan and then embed the question with Cohere, you're comparing vectors from two different coordinate systems and the results will be garbage.

Vector stores. A database optimized for "find me the k nearest vectors to this query vector." On AWS, your options are:

  • OpenSearch Serverless — the default Bedrock pairs with. Managed, scales automatically, good for most things.

  • Aurora PostgreSQL with pgvector — if you already live in Postgres and want to keep your vectors next to your relational data.

  • Pinecone — third-party managed vector DB, very common in the industry, available through AWS Marketplace.

  • Redis — with the RediSearch module, fast for smaller corpora.

  • MongoDB Atlas — also supported by Bedrock Knowledge Bases.

They all do roughly the same thing. Pick the one that matches the rest of your stack.

Chunking. This is where RAG systems quietly live or die. Too-small chunks lose context; too-large chunks dilute relevance and waste tokens. Sensible defaults: 300-500 tokens per chunk with 10-20% overlap. For structured docs (markdown with headings, code with functions) split on natural boundaries — headings, function defs — rather than blindly on character count.

Bedrock Knowledge Bases: The Managed Path

AWS's answer to "I don't want to write all that plumbing" is Knowledge Bases for Amazon Bedrock. You point it at an S3 bucket of documents, pick an embedding model, pick a vector store, and it handles ingest, chunking, embedding, storage, and retrieval for you. At query time you call the Retrieve API (get raw chunks) or the RetrieveAndGenerate API (get a finished answer). Under the hood it's the same six-step flow, just wrapped in AWS APIs so you don't own the code.

The trade-off is the usual one: managed services are fast to stand up but give you fewer knobs. If your retrieval quality isn't great out of the box, you have limited room to tune chunking strategy, rerank results, or inject custom filters. For prototypes and internal tools, Knowledge Bases is a no-brainer. For a core product feature where retrieval quality directly drives revenue, you will probably outgrow it and end up writing your own pipeline with a raw vector store.

Fine-Tuning vs. RAG

The exam likes this comparison and so does real life. Fine-tuning changes the model's weights; RAG does not. Fine-tuning teaches the model new patterns of behavior (a tone of voice, a structured output format, a domain-specific style). RAG injects new facts at query time without touching the model. They solve different problems, and they're complementary, not alternatives — you can absolutely fine-tune a model to be good at a task and also give it RAG-retrieved context at inference time.

Rule of thumb: if the problem is knowledge, use RAG. If the problem is behavior, fine-tune. "The model doesn't know our products" → RAG. "The model won't respond in the JSON schema we need" → fine-tune (or better prompting first).

What Goes Wrong In Production

A few things that will bite you:

  • Bad retrieval. The model can only work with what you hand it. If your top-k chunks don't contain the answer, the model will either hallucinate or say it doesn't know. Measure retrieval quality independently of generation quality — it's usually the bottleneck.

  • Stale indexes. Your docs change. Your embeddings don't update themselves. Build reindexing into your pipeline from day one or you'll be shipping three-month-old answers.

  • Context window pressure. Every retrieved chunk is tokens you're paying for, and long prompts get slower and more expensive. Don't just crank k to 20 — rerank and trim.

  • Prompt injection via retrieved content. If a user can plant text in a document that eventually gets retrieved ("ignore previous instructions and…"), they can hijack your model. Treat retrieved content as untrusted input. Bedrock Guardrails help, but they're not a complete answer.

  • Citation and trust. If the whole point is grounding answers in real docs, show the user which docs. Return the chunk IDs alongside the answer. Users trust a cited answer far more than an uncited one.

RAG isn't magic. It's a retrieval system bolted onto an LLM, and the retrieval half is where most of the engineering actually lives. Get the retrieval right and the model mostly takes care of itself. Get it wrong and no amount of prompt tuning will save you.