Prediction Tracker
Every forward-looking, falsifiable claim made on the show, with a resolution date. As each date passes, predictions get scored. This is the thing a chat-first archive structurally cannot do.
Vera Rubin ships this year, Vera Rubin Ultra next year, Feynman the year after, a new architecture every single year.
Scored on: Are Vera Rubin / Vera Rubin Ultra / Feynman generally available in 2026 / 2027 / 2028?
No supply-chain bottleneck (CoWoS, logic, memory, EUV) lasts longer than two or three years.
Scored on: Is any single component still the binding constraint on Nvidia output in 2029?
Token cost decreases by roughly an order of magnitude every single year.
Scored on: Does Nvidia's cost per token fall ~10x year over year?
TPU and Trainium growth is '100% Anthropic', a unique instance, not a trend.
Scored on: Does material non-Anthropic custom-ASIC demand emerge by 2027?
Logic and EUV capacity can be scaled 2x/year, easy within two or three years once there is a demand signal.
Scored on: Does TSMC roughly double AI-logic output year over year through 2028?
The agentic-security future, one capable AI agent surrounded by thousands of agents keeping it safe, surely is going to happen.
Scored on: No crisp criterion or date, flagged as low-falsifiability.
Sparse attention will become a more widely adopted architecture at frontier labs, with DeepSeek's published mechanism pointing the direction.
Scored on: Do at least two top-5 frontier providers ship a production model with sparse attention as the primary mechanism by 2029?
Nvidia's Rubin generation will ship with a scale-up domain of ~500+ GPUs, roughly 4x Blackwell's 72, unlocking larger MoE models in one interconnect domain.
Scored on: Do Rubin NVL racks ship with a scale-up domain of at least 400 GPUs by end of 2027?
Frontier context lengths will stay roughly in the 100-200K range because memory bandwidth is the hard wall and HBM is not improving fast enough.
Scored on: Does no top-5 frontier model offer a >500K-token window at standard pricing by 2028?
Optimally trained models will serve roughly as many inference tokens as they saw in pre-training, implying current frontier models are ~100x over-trained relative to Chinchilla-optimal.
Scored on: Does a credible analysis confirm a ~150T-token frontier model serves at least 10T inference tokens before deprecation?
The one-hour KV-cache pricing tier on frontier APIs likely corresponds to spinning disk; the drain-time math points to it.
Scored on: Does a frontier lab confirm or credibly leak that long-duration KV-cache persistence uses spinning disk by 2028?
Forward search and simulation to estimate value will make a comeback in LLMs, even if not in AlphaGo's exact MCTS form.
Scored on: Does a widely adopted LLM paradigm incorporate explicit multi-step forward tree search (beyond chain-of-thought) and get credited as a breakthrough by end of 2027?
Today's models can't reliably pick the next experiment or do lateral 'return to first principles' thinking; successor models may close this gap.
Scored on: Does a benchmarked agent autonomously pivot away from a dead-end research track without human prompting, confirmed in a peer-reviewed eval by end of 2027?
A verifiable game like Go could be the outer-loop environment for training automated AI researchers, with skills transferring to harder domains.
Scored on: Does a published agent improve a measurable AI metric through self-directed experiment selection using a game as the verification loop by end of 2028?
Many of KataGo's algorithmic compute multipliers will become irrelevant as GPUs improve; any given multiplier's benefit is transitory.
Scored on: Does a replication show at least three of KataGo's tricks are redundant on Blackwell-class hardware?
As RL tasks get longer-horizon, samples-per-FLOP fall, making naive policy-gradient RL increasingly inefficient, a structural problem for agentic training.
Scored on: Do studies confirm declining bits-per-FLOP for policy-gradient RL as task horizon grows, by end of 2027?
We'll keep finding very fundamental new principles, analogous to Church-Turing or Noether's theorem, rather than exhausting the supply of deep ideas.
Scored on: If no principle of comparable depth is articulated within ~50 years, that weighs against it.
Quantum computers may handle a strictly larger class of interesting computations, and a quantum AGI would be qualitatively different from a classical one.
Scored on: A proof that BQP = BPP would falsify the first part; broad practical quantum advantage would support it.
AI will help with data-intensive, well-specified problems (like protein structure) but won't automatically resolve the deeper bottlenecks needing long hostile verification loops or paradigm shifts.
Scored on: If AI produces multiple Nobel-class genuine paradigm shifts (not pattern-matching) within 10 years, the claim weakens.
The science-and-technology tree is so large that different civilizations explore radically different branches, implying large gains from trade between them indefinitely.
Scored on: Requires observing multiple advanced civilizations; not empirically testable now.
As ancient-DNA samples grow beyond the current ~16,000 individuals, many more positions under selection will be detected; today's findings are only what's visible at this scale.
Scored on: Do larger ancient-DNA datasets reveal substantially more selected positions in follow-up studies?
The model that Neanderthals are genetically-swamped modern humans sharing a ~300,000-year-old Middle Stone Age origin will prove more parsimonious than the current sister-lineage consensus.
Scored on: Does integrating modern-human substructure with archaic ancient-DNA data align in timing and displace the sister-lineage model?
Applying the same methodology beyond Europe and the Middle East will reveal comparable or stronger selection signals, making cross-region comparison a major near-term frontier.
Scored on: Do studies in other world regions surface selection signals of similar strength?