Context Caching

AI & MACHINE LEARNING

Quick Definition

Context caching (also called prompt caching) lets you store a long, stable portion of a prompt on the inference provider's side. On subsequent calls that reuse that same prefix, you pay a fraction of the normal token cost (often 10%) and get faster time-to-first-token. Anthropic, OpenAI, and Google all support some form of caching. Cache lifetime is typically 5 minutes to 1 hour.

How it works

When a request includes a marked-as-cacheable prefix (system prompt, retrieved documents, tool definitions), the provider hashes it, runs the expensive prefill computation once, and stores the resulting key/value attention state. The next request with the same prefix skips the prefill and starts generating from the cached state. The discount applies to cached tokens only; new tokens (the user's specific question) are billed at normal rates.

Cache hits depend on byte-for-byte prefix matching. A single character change at the start of a long system prompt invalidates the cache. Best practice: pin stable content at the front, dynamic content at the back.

Why it matters

For agents that run with large system prompts or stuffed-with-RAG-context calls, caching can cut cost by 75%+ and time-to-first-token in half. It is the single highest-leverage optimization for production LLM workloads.

Where you'll see this on TerminalFeed

Premium TerminalFeed endpoints that compose multi-source data benefit from upstream context caching when consumers reuse them in agent prompts.

Context Caching

Quick Definition

How it works

Why it matters

Where you'll see this on TerminalFeed

Related terms