The Dwarkesh Reference

Mental model

Unlike weights (distributable by pipelining) and compute (amortizable by batching), KV-cache memory resists both, making it the binding constraint on context length.

Who: Reiner Pope
Topic: KV cache wall
Source: Reiner Pope — The math behind how LLMs are trained and served (01:03:37)