2026-05-15

Eric Jang

Building AlphaGo from scratch

Predictions (5)

PendingForward search in LLMs · 01:45:47

Forward search and simulation to estimate value will make a comeback in LLMs, even if not in AlphaGo's exact MCTS form.

Today's models can't reliably pick the next experiment or do lateral 'return to first principles' thinking; successor models may close this gap.

PendingGames as training loop · 02:22:16

A verifiable game like Go could be the outer-loop environment for training automated AI researchers, with skills transferring to harder domains.

PendingAlgorithmic multipliers · 02:22:16

Many of KataGo's algorithmic compute multipliers will become irrelevant as GPUs improve; any given multiplier's benefit is transitory.

PendingRL efficiency · 02:12:02

As RL tasks get longer-horizon, samples-per-FLOP fall, making naive policy-gradient RL increasingly inefficient, a structural problem for agentic training.

Where they disagreed

Dwarkesh

Eric Jang

Does understanding AlphaGo in detail make it less impressive?

Yes: explicit tree search and hand-tuned heuristics make it look engineered rather than emergent; simple RLVR on LLMs is more surprising.

No: the profundity is that a 10-layer network compresses an intractable search into one forward pass, which is genuinely mysterious.

Is a verifiable outer loop like Go win-rate enough to drive meaningful AI self-improvement?

Skeptical: a win-rate loop doesn't capture paradigm-shifting discoveries like scaling laws; we improve only what we measure.

More optimistic: Go encapsulates many research sub-problems and backstops reward hacking; skills should transfer to harder domains.

Mental models (4)

MCTS as policy improvement · 01:00:33

AlphaGo never has to solve the zero-reward exploration problem: with an accurate value function MCTS yields a strictly better action label each move, so training stays supervised learning on improved targets.

Compressing search · 01:00:33

AlphaGo and AlphaFold show that problems that look NP-hard can be approximately solved by a 10-layer network, suggesting our understanding of hardness is incomplete for problems with exploitable structure.

Bits per sample · 02:12:02

Soft labels (the MCTS visit distribution, or teacher logits) carry far more bits per sample than one-hot RL rewards, explaining why distillation works and why AlphaGo trains on the full visit-count distribution.

Research method · 01:45:47

Study scaling laws only after the system works and is bug-free; scaling plots on a broken system describe the scaling of bad data, not the phenomenon.

Claims (4)

≈ ApproxKataGo · 00:00:00

KataGo (2020, David Wu) achieved roughly a 40x reduction in compute to train a strong Go bot tabula rasa versus prior state of the art.

✓ VerifiedNetwork architecture · 01:00:33

AlphaGo Lee (2016) used two separate networks for policy and value; every later variant (Zero, AlphaZero, MuZero) merged them into one network with two heads.

✓ VerifiedValue estimation · 00:32:04

AlphaGo Lee mixed the value-network estimate with a real Monte Carlo rollout, a technique dropped in all later AlphaGo variants.

? UnverifiedReplication cost · 01:45:47

Jang received ~$10K of compute from Prime Intellect and spent ~$7K replicating AlphaGo, work that once took a DeepMind team and millions.