The Dwarkesh Reference
← Back
PredictionPending

As RL tasks get longer-horizon, samples-per-FLOP fall, making naive policy-gradient RL increasingly inefficient, a structural problem for agentic training.

Who
Dwarkesh Patel
Topic
RL efficiency
How it gets scored
Do studies confirm declining bits-per-FLOP for policy-gradient RL as task horizon grows, by end of 2027?
Resolves
2027-12-31