Reiner Pope
The math behind how LLMs are trained and served
Listen / read the full episode ↗Predictions (5)
Sparse attention will become a more widely adopted architecture at frontier labs, with DeepSeek's published mechanism pointing the direction.
Nvidia's Rubin generation will ship with a scale-up domain of ~500+ GPUs, roughly 4x Blackwell's 72, unlocking larger MoE models in one interconnect domain.
Frontier context lengths will stay roughly in the 100-200K range because memory bandwidth is the hard wall and HBM is not improving fast enough.
Optimally trained models will serve roughly as many inference tokens as they saw in pre-training, implying current frontier models are ~100x over-trained relative to Chinchilla-optimal.
The one-hour KV-cache pricing tier on frontier APIs likely corresponds to spinning disk; the drain-time math points to it.
Where they disagreed
Mental models (4)
Almost all LLM inference economics, latency floors, cost minimums, batch optima, context-length pricing, follow from two hardware numbers (memory bandwidth, FLOPs) and two model numbers (total params, KV bytes per token).
Unlike weights (distributable by pipelining) and compute (amortizable by batching), KV-cache memory resists both, making it the binding constraint on context length.
Providers must price near marginal cost, so public API pricing tiers directly encode internal architecture choices like KV bytes-per-token and which memory tier backs each cache duration.
Ciphers and neural nets independently converged on the same structural motifs (alternating mixing, residual connections, layered nonlinearity) despite opposite goals, propagating information across all inputs.
Claims (4)
DeepSeek V3 has roughly 37 billion active and ~670 billion total parameters.
The FLOPs-to-memory-bandwidth ratio on modern accelerators is ~300 and has stayed roughly stable across A100, H100 and B100.
Google's Gemini API was serving on the order of hundreds of millions of tokens per second globally about a year before the recording.
Using ~100B active params and ~150T pre-training tokens, frontier models look ~100x over-trained versus Chinchilla-optimal.