2026-05-22

Reiner Pope

The math behind how LLMs are trained and served

Predictions (5)

Sparse attention will become a more widely adopted architecture at frontier labs, with DeepSeek's published mechanism pointing the direction.

PendingScale-up domain · 00:32:09

Nvidia's Rubin generation will ship with a scale-up domain of ~500+ GPUs, roughly 4x Blackwell's 72, unlocking larger MoE models in one interconnect domain.

PendingContext length · 01:33:02

Frontier context lengths will stay roughly in the 100-200K range because memory bandwidth is the hard wall and HBM is not improving fast enough.

PendingOver-training · 01:18:59

Optimally trained models will serve roughly as many inference tokens as they saw in pre-training, implying current frontier models are ~100x over-trained relative to Chinchilla-optimal.

PendingKV cache storage · 01:33:02

The one-hour KV-cache pricing tier on frontier APIs likely corresponds to spinning disk; the drain-time math points to it.

Where they disagreed

Dwarkesh

Reiner Pope

Could a 'slow mode' where users wait minutes make inference dramatically cheaper?

If users accepted much longer latency, inference cost could fall a lot, maybe toward zero.

No, KV cache and compute are per-sequence and can't be amortized across the batch beyond the weight-amortization point (~batch 300-2400); there's a hard cost floor.

Can sparse attention break the ~200K context ceiling set by memory bandwidth?

Sparse attention's square-root scaling could open the path to the 100M-token contexts needed for in-context learning to replace continual learning.

Skeptical: it helps, but context lengths have stagnated at 100-200K for two years, suggesting the gains may already be priced in and too-sparse degrades quality.

Mental models (4)

Roofline analysis · 00:00:00

Almost all LLM inference economics, latency floors, cost minimums, batch optima, context-length pricing, follow from two hardware numbers (memory bandwidth, FLOPs) and two model numbers (total params, KV bytes per token).

KV cache wall · 01:03:37

Unlike weights (distributable by pipelining) and compute (amortizable by batching), KV-cache memory resists both, making it the binding constraint on context length.

Pricing as a signal · 01:33:02

Providers must price near marginal cost, so public API pricing tiers directly encode internal architecture choices like KV bytes-per-token and which memory tier backs each cache duration.

Convergent evolution · 02:04:02

Ciphers and neural nets independently converged on the same structural motifs (alternating mixing, residual connections, layered nonlinearity) despite opposite goals, propagating information across all inputs.

Claims (4)

≈ ApproxDeepSeek V3 · 00:00:00

DeepSeek V3 has roughly 37 billion active and ~670 billion total parameters.

≈ ApproxArithmetic intensity · 00:00:00

The FLOPs-to-memory-bandwidth ratio on modern accelerators is ~300 and has stayed roughly stable across A100, H100 and B100.

? UnverifiedThroughput · 00:00:00

Google's Gemini API was serving on the order of hundreds of millions of tokens per second globally about a year before the recording.

≈ ApproxOver-training math · 01:18:59

Using ~100B active params and ~150T pre-training tokens, frontier models look ~100x over-trained versus Chinchilla-optimal.