The Dwarkesh Reference
← Back
Mental model

Soft labels (the MCTS visit distribution, or teacher logits) carry far more bits per sample than one-hot RL rewards, explaining why distillation works and why AlphaGo trains on the full visit-count distribution.

Who
Eric Jang
Topic
Bits per sample