The Dwarkesh Reference
← Back
Mental model

AlphaGo never has to solve the zero-reward exploration problem: with an accurate value function MCTS yields a strictly better action label each move, so training stays supervised learning on improved targets.

Who
Eric Jang
Topic
MCTS as policy improvement