Mental model
AlphaGo never has to solve the zero-reward exploration problem: with an accurate value function MCTS yields a strictly better action label each move, so training stays supervised learning on improved targets.
- Who
- Eric Jang
- Topic
- MCTS as policy improvement