PredictionPending

As RL tasks get longer-horizon, samples-per-FLOP fall, making naive policy-gradient RL increasingly inefficient, a structural problem for agentic training.

Who: Dwarkesh Patel
Topic: RL efficiency
How it gets scored: Do studies confirm declining bits-per-FLOP for policy-gradient RL as task horizon grows, by end of 2027?
Resolves: 2027-12-31
Source: Eric Jang — Building AlphaGo from scratch (02:12:02)