The Dwarkesh Reference
2026-05-22

Reiner Pope

The math behind how LLMs are trained and served

Listen / read the full episode ↗

Predictions (5)

Where they disagreed

Dwarkesh
Reiner Pope
Could a 'slow mode' where users wait minutes make inference dramatically cheaper?
If users accepted much longer latency, inference cost could fall a lot, maybe toward zero.
No, KV cache and compute are per-sequence and can't be amortized across the batch beyond the weight-amortization point (~batch 300-2400); there's a hard cost floor.
Can sparse attention break the ~200K context ceiling set by memory bandwidth?
Sparse attention's square-root scaling could open the path to the 100M-token contexts needed for in-context learning to replace continual learning.
Skeptical: it helps, but context lengths have stagnated at 100-200K for two years, suggesting the gains may already be priced in and too-sparse degrades quality.

Mental models (4)

Claims (4)