Retrieval-Augmented Generation (RAG)

AI & MACHINE LEARNING

Quick Definition

Retrieval-Augmented Generation (RAG) is the pattern that combines a language model with a retrieval system. Before the model answers a question, an external search step finds relevant documents (using embeddings for semantic similarity, or keyword search, or both), and those documents are injected into the prompt as context. The model then generates an answer grounded in the retrieved content rather than relying solely on what it learned during training.

How it works

A typical RAG pipeline has three stages: (1) Indexing, where documents are chunked, embedded, and stored in a vector database; (2) Retrieval, where a query is embedded and used to find the top-K most similar chunks; (3) Generation, where the retrieved chunks are concatenated with the query and sent to the LLM. The output usually includes citations back to the source documents.

Quality depends heavily on chunking strategy (size, overlap, semantic vs fixed), embedding model choice, and reranking. Hybrid retrieval (combining dense embedding similarity with sparse BM25 keyword search) typically beats either alone.

Why it matters

RAG solved two huge problems with LLMs: outdated training data (the model is frozen at its training cutoff) and hallucination (the model making things up). With RAG, the model has access to current, verifiable information for every query. Most production AI applications use some form of RAG.

Where you'll see this on TerminalFeed

TerminalFeed's premium agent-context endpoint is essentially a one-call RAG primitive: it returns a curated, citation-rich snapshot of the current world (markets, news, sentiment) in a format ready to drop into a system prompt.