Inference is the runtime phase of a machine learning model, where the model takes input and produces output. Training is the offline process that produces the model's weights; inference is what happens every time anyone uses the model afterwards. For language models, inference is generating text in response to a prompt. For image models, it is producing an image from a description. For embedding models, it is turning text into a vector.
Inference for large models is dominated by GPU memory bandwidth and compute. A 70B-parameter model needs about 140GB of memory in FP16, which usually means multiple GPUs. Inference servers (vLLM, TensorRT-LLM, llama.cpp) use techniques like KV caching (reuse computation across tokens), continuous batching (multiple requests share GPU time), quantization (lower-precision weights), and speculative decoding (predict multiple tokens at once) to drive cost down.
Hosted inference (Anthropic, OpenAI, Together, Anyscale) hides this complexity. Self-hosted inference gives you control but requires real GPU expertise.
Inference cost is what you pay every time someone uses your AI feature. Training is a one-time cost; inference is recurring. As model usage scales, inference is where the bills add up. Choice between hosted and self-hosted is mostly about scale, predictability, and engineering capacity.
The AI Agent Tracker lists inference providers and self-hostable models you can deploy yourself.