Mixture of Experts (MoE)

AI & MACHINE LEARNING

Quick Definition

Mixture of Experts (MoE) is a model architecture where the network is divided into many "experts" (subnetworks), and a small router selects a few experts to activate for each input token. A model might have 600 billion total parameters but only activate 30 billion per token. The result: the capacity benefits of a huge model with the inference cost of a much smaller one. Mixtral, DeepSeek-V3, and recent GPT/Claude internals all use MoE.

How it works

A typical MoE layer has N experts (often 8-256) and a gating network that picks the top-K (often 2-8) for each token. Only those K experts run, the rest sit idle. This is sparse activation: total parameters scale linearly with N, but compute per token scales with K. The router is trained jointly with the experts to send tokens to the experts that handle them best.

MoE introduces engineering complexity: load balancing across experts (so some experts do not get all the work), expert parallelism across GPUs, and stability during training. The payoff is a better cost/quality curve than dense models of the same compute budget.

Why it matters

MoE is one of the main reasons frontier models keep getting better while inference cost grows sublinearly with capability. It is the architectural shift that made trillion-parameter models economically viable.

Where you'll see this on TerminalFeed

The AI Agent Tracker lists models that use MoE, including the open-source Mixtral and DeepSeek families.