Context caching (also called prompt caching) lets you store a long, stable portion of a prompt on the inference provider's side. On subsequent calls that reuse that same prefix, you pay a fraction of the normal token cost (often 10%) and get faster time-to-first-token. Anthropic, OpenAI, and Google all support some form of caching. Cache lifetime is typically 5 minutes to 1 hour.
When a request includes a marked-as-cacheable prefix (system prompt, retrieved documents, tool definitions), the provider hashes it, runs the expensive prefill computation once, and stores the resulting key/value attention state. The next request with the same prefix skips the prefill and starts generating from the cached state. The discount applies to cached tokens only; new tokens (the user's specific question) are billed at normal rates.
Cache hits depend on byte-for-byte prefix matching. A single character change at the start of a long system prompt invalidates the cache. Best practice: pin stable content at the front, dynamic content at the back.
For agents that run with large system prompts or stuffed-with-RAG-context calls, caching can cut cost by 75%+ and time-to-first-token in half. It is the single highest-leverage optimization for production LLM workloads.
Premium TerminalFeed endpoints that compose multi-source data benefit from upstream context caching when consumers reuse them in agent prompts.