Mental model

AlphaGo never has to solve the zero-reward exploration problem: with an accurate value function MCTS yields a strictly better action label each move, so training stays supervised learning on improved targets.