Eric Jang
Building AlphaGo from scratch
Listen / read the full episode ↗Predictions (5)
Forward search and simulation to estimate value will make a comeback in LLMs, even if not in AlphaGo's exact MCTS form.
Today's models can't reliably pick the next experiment or do lateral 'return to first principles' thinking; successor models may close this gap.
A verifiable game like Go could be the outer-loop environment for training automated AI researchers, with skills transferring to harder domains.
Many of KataGo's algorithmic compute multipliers will become irrelevant as GPUs improve; any given multiplier's benefit is transitory.
As RL tasks get longer-horizon, samples-per-FLOP fall, making naive policy-gradient RL increasingly inefficient, a structural problem for agentic training.
Where they disagreed
Mental models (4)
AlphaGo never has to solve the zero-reward exploration problem: with an accurate value function MCTS yields a strictly better action label each move, so training stays supervised learning on improved targets.
AlphaGo and AlphaFold show that problems that look NP-hard can be approximately solved by a 10-layer network, suggesting our understanding of hardness is incomplete for problems with exploitable structure.
Soft labels (the MCTS visit distribution, or teacher logits) carry far more bits per sample than one-hot RL rewards, explaining why distillation works and why AlphaGo trains on the full visit-count distribution.
Study scaling laws only after the system works and is bug-free; scaling plots on a broken system describe the scaling of bad data, not the phenomenon.
Claims (4)
KataGo (2020, David Wu) achieved roughly a 40x reduction in compute to train a strong Go bot tabula rasa versus prior state of the art.
AlphaGo Lee (2016) used two separate networks for policy and value; every later variant (Zero, AlphaZero, MuZero) merged them into one network with two heads.
AlphaGo Lee mixed the value-network estimate with a real Monte Carlo rollout, a technique dropped in all later AlphaGo variants.
Jang received ~$10K of compute from Prime Intellect and spent ~$7K replicating AlphaGo, work that once took a DeepMind team and millions.