Y Combinator

Y Combinator: Recursion Is Quietly Rewriting AI’s Rules—Not Just Bigger Transformers

Two new papers show that looping a model’s own computations—recursion—lets small AI models outthink giants, upending the scaling playbook.

If you only read one thing

A 27-million-parameter model, trained from scratch on just 1,000 puzzles, beat state-of-the-art AI at reasoning—by looping its own thoughts.

Recursion, not just stacking more transformer layers, is changing how AI models reason. Francois Shaard and the panel break down two 2025 papers—Hierarchal Reasoning Models (HRM) and Tiny Recursive Models (TRM)—that show how running a model’s own computations in loops at inference time lets small models solve tasks that stump even the biggest LLMs. HRM, inspired by brain wave frequencies, uses three levels of recursion and, with only 27 million parameters and no pre-training, crushed the ARC prize benchmarks where previous models failed. The real breakthrough is in training: instead of classic backpropagation through time (which doomed RNNs with vanishing gradients), these models use deep equilibrium learning and fixed-point tricks to train deep recursions without losing signal.

The hosts debate whether biological plausibility matters—one says, 'bioplausibility is a scientific starting point, not a strict guide'—and whether recursion at test time is even needed. Constantine’s analysis shows that training with 16 recursion steps and testing with just one keeps nearly all the performance, flipping old assumptions. The upshot: recursion is a scaling law in its own right, and combining it with large models could unlock 'some crazy stuff.' If you’re still betting on bigger transformers alone, you’re missing where the field is actually moving.

Why it lands

Recursion at inference time lets small, efficient models outperform brute-force scaling. For anyone building or investing in AI, this signals a shift: the next breakthroughs may come from clever algorithmic tricks, not just more compute. Understanding these new scaling laws is now table stakes for anyone serious about AI’s future.

Why Transformers Hit a Wall

Large language models (LLMs) process all input tokens in parallel, which avoids the vanishing gradient problem that crippled RNNs. But this design means LLMs can’t perform tasks like sorting that require more comparisons than the model has layers. Without external memory, LLMs are fundamentally limited on these algorithmic tasks.

  • LLMs can’t perform O(n log n) tasks like sorting in one shot due to layer limits.
  • External memory (like a Turing tape) is needed for faster algorithms, which LLMs lack.

Recursion, Not Depth, Unlocks Reasoning

The HRM model uses three levels of recursion, inspired by brain wave frequencies, to repeatedly refine its answers. With just 27 million parameters and no pre-training, HRM achieved 70% on ARC prize 1, where previous models scored zero. Recursion lets the model reuse weights and hidden states, enabling it to solve complex reasoning tasks that stumped larger models.

  • HRM achieved 70% on ARC prize 1, where previous models scored zero.
  • Recursion allows models to reuse weights and hidden states, enabling deeper reasoning.

Training Tricks: Deep Equilibrium and Fixed Points

Instead of backpropagating through every recursion step—which causes gradients to vanish—HRM and TRM use deep equilibrium learning and fixed-point iteration. This means the model does multiple forward passes on the same batch, updating hidden states without resetting them, and stops gradients at certain points.

This approach stabilizes training and makes deep recursion practical. Backpropagating through the full recursion loop, rather than just truncating, further improves performance.

  • Deep equilibrium learning stops gradients at certain points, stabilizing training.
  • Backpropagating through deep recursion (not just truncated steps) boosts performance.

Recursion Depth: Train Once, Test Light

Constantine’s analysis found that training with deep recursion (16 steps) but testing with just one step retains nearly all the model’s performance. This challenges the assumption that deep recursion is needed at inference time and suggests that most of the benefit comes from how the model is trained, not how it is run at test time.

  • Training with 16 steps and testing with one keeps nearly all performance.
  • Test-time recursion is less important than train-time recursion.

Worth stealing

  • Recursion at inference time lets small models outperform much larger ones on reasoning tasks.
  • Deep equilibrium learning and fixed-point iteration make deep recursion trainable, avoiding RNN pitfalls.
  • Biological inspiration is useful, but computational efficiency usually wins over bioplausibility.
  • Training with deep recursion but testing with shallow recursion is surprisingly effective.

Lines worth repeating

  • The gradient gets noisier and noisier and then it just kind of stops to work.

    Francois Shaard

  • It’s literally impossible for the model to map from unsorted list to sorted list in one shot.

    Francois Shaard

  • There’s exactly three levels of recursion occurring here.

    Francois Shaard

  • I tend to not be bounded by bioplausibility when I think about what machine learning systems we should prioritize.

    Speaker 2