PredictionPending
As RL tasks get longer-horizon, samples-per-FLOP fall, making naive policy-gradient RL increasingly inefficient, a structural problem for agentic training.
- Who
- Dwarkesh Patel
- Topic
- RL efficiency
- How it gets scored
- Do studies confirm declining bits-per-FLOP for policy-gradient RL as task horizon grows, by end of 2027?
- Resolves
- 2027-12-31