Inference

AI & MACHINE LEARNING

Quick Definition

Inference is the runtime phase of a machine learning model, where the model takes input and produces output. Training is the offline process that produces the model's weights; inference is what happens every time anyone uses the model afterwards. For language models, inference is generating text in response to a prompt. For image models, it is producing an image from a description. For embedding models, it is turning text into a vector.

How it works

Inference for large models is dominated by GPU memory bandwidth and compute. A 70B-parameter model needs about 140GB of memory in FP16, which usually means multiple GPUs. Inference servers (vLLM, TensorRT-LLM, llama.cpp) use techniques like KV caching (reuse computation across tokens), continuous batching (multiple requests share GPU time), quantization (lower-precision weights), and speculative decoding (predict multiple tokens at once) to drive cost down.

Hosted inference (Anthropic, OpenAI, Together, Anyscale) hides this complexity. Self-hosted inference gives you control but requires real GPU expertise.

Why it matters

Inference cost is what you pay every time someone uses your AI feature. Training is a one-time cost; inference is recurring. As model usage scales, inference is where the bills add up. Choice between hosted and self-hosted is mostly about scale, predictability, and engineering capacity.

Where you'll see this on TerminalFeed

The AI Agent Tracker lists inference providers and self-hostable models you can deploy yourself.

Inference

Quick Definition

How it works

Why it matters

Where you'll see this on TerminalFeed

Related terms