Scaling Law Discovery (SLD): Toward AI Scientist for AI Research

AI scientists for AI research could eventually automate AI self-improvement effectively. But as an early-stage concept, how it can be reliably approached remains underexplored. This blog introduces our recent research toward this goal from an interesting yet important question:

            Can AI discover its own "Newton's laws" that govern AI systems more accurately and efficiently than humans?
          

          Scaling laws are arguably the most "scientific" instrument in modern AI. They model the relationship between performance and hyperparameters, allowing researchers to predict the optimal configuration for gigantic models via small-scale experiments.
           
        

Our new work on Scaling Law Discovery (SLD) is a benchmark on a variety of scaling laws studies and an evolution-based agentic framework for automating the process of discovering scaling laws from experimental data. As we will show below, and perhaps surprisingly, our SLD agent can already predict the scaling behaviors better than humans.

Scaling Law Discovery: What, Why, and Challenge

A scaling law is an empirical relationship that predicts a model's performance (loss, perplexity, accuracy, etc.) as a function of "scale variables", such as:

Model size (N)
Dataset size (D)
Compute / FLOPs (C)
Hyperparameters like learning rate and batch size
Architecture knobs (e.g., number of experts in MoE)

The canonical example is the "Chinchilla-style" pretraining form [1], where loss is modeled as a power-law function of model size and data size:

The Chinchilla-style scaling law

The predictive capability has helped with plenty of real decisions, including compute-optimal planning [1], choosing which model to fine-tune [2], and picking hyperparameters that would otherwise require huge sweeps [3].

Challenge: A real example.

In 2023, we ran into the problem of selecting the most appropriate LLM for downstream fine-tuning tasks. Specifically, given a set of pretrained LLMs and only a limited subset of downstream data, how can we predict which model will perform best after full fine-tuning?

Inspired by established scaling laws, we initially attempted to fit a "Chinchilla-style" power law to predict fine-tuning performance. Unfortunately, we observed a consistent phase transition pattern—what we termed a pre-power phase followed by a power phase. Existing power-law formulations simply couldn't capture this behavior accurately.

The two-phase behavior observed in fine-tuning: pre-power phase followed by power phase

We ended up dedicating a full paper to investigating scaling laws in this scenario and proposed a "rectified" law. This new formulation incorporates an additional property—the effective size of pre-learned data—that was absent in previous laws. Empirically, the fit was significantly better: our paper [2] reports an average RMSD of 0.007 for the rectified law versus 0.036 for the vanilla law (using the same number of fitted parameters).

          A bitter lesson: Despite its empirical success, deriving this new law consumed a tremendous amount of research time. For every new use case, discovering the correct scaling law becomes the bottleneck. The cycle of hypothesis → fit → failure → redesign is painfully labor-intensive.
        

With two years of progress in AI, scaling laws are everywhere:

Supervised fine-tuning laws
Vocabulary-size laws
MoE laws
Domain-mixture laws
Parallelism laws
Learning-rate & batch-size laws
U-shaped/double-descent patterns

The SLD paper [4] summarizes this clearly: the scope has expanded rapidly, but discoveries are still manual and case-specific, requiring repeated cycles of hypothesis generation (design the symbolic law expression) and experimentation (fit the law to observed datapoints). we want to go back to this problem, but in a fundamentally different way: we want AI to discover scaling laws automateically.

Introducing Scaling Law Discovery (SLD)

Our work frames scaling law discovery as: given observed experimental trials, output

A symbolic formula (the law form), and
An optimizer that robustly fits coefficients on seen data, so the resulting parameterized law extrapolates well.

More precisely, the input is a dataset of trials where:

x are feature variables (e.g., model size N, dataset size D, lr, bsz)
y is the target metric (e.g., loss)
c is a control index (a setting like "which model" or "which corpus"), where all settings share the same law form but have different fitted coefficients.

That "shared form, different coefficients" detail sounds small, but it's exactly what makes real scaling-law work different from many synthetic symbolic regression tasks.

SLDBench: A Benchmark with 5,000+ Real Experiments

To evaluate this problem rigorously, we curated SLDBench, a scaling law discovery testbed built from over 5,000 LLM training experiments collected from existing scaling-law literature.

SLDBench includes tasks spanning a wide range of scaling scenarios:

Table 1: Overview of SLDBench tasks spanning diverse scaling scenarios

Why SLDBench is a good "Scientific Discovery" Benchmark

The agent gets the experimental results; it does not need to run heavy training.
The score is continuous and objective: extrapolation on held-out "large-scale" settings.
There is no learned reward model, and the "true best law" is unknown even to human experts.
The evaluation environment is a sandbox terminal: agents output a law.py, and we compute metrics (NMSE, NMAE, R²) on an extrapolation test split.

This makes SLDBench a natural yardstick for "can agents do long-horizon, research-like iteration?"

SLDAgent: Co-evolving a Law and Its Optimizer

A key lesson from our own experience (and from watching others do this) is:

          Discovering a scaling law is not just "find a formula." It's "find a formula and a fitting procedure that actually works."
        

So our agent, SLDAgent, evolves two coupled components:

Expression: the law form f(x; θ)
Optimization: a robust routine that fits θ

It starts from a baseline (often power-law + BFGS), then iteratively proposes code modifications, executes them, and keeps the best variants while maintaining diversity via a multi-island + MAP-Elites style evolutionary database. (Implementation note: the system is built on the OpenEvolve framework [5].)

SLDAgent scaffolding: an evolutionary pipeline that co-evolves law expressions and fitting procedures

This "co-optimization" framing matches the real research loop: hypothesize → fit → evaluate → refine.

Our Agent Can Beat Human Discoveries

Across SLDBench, SLDAgent achieves state-of-the-art performance and "superhuman" results relative to human baselines and existing agent systems.

SLDAgent outperforms human-derived laws across benchmark tasks

But the more interesting part is how it wins:

1) Better Laws Are Often More "Principled," Not Just More Complex

For example, in the SFT task, SLDAgent discovers a saturated power-law parameterization where the "pre-learned data size" appears as a dimensionless ratio (D/θ₃), making θ₃ retain the natural unit of dataset size and improving interpretability and downstream usability.

Human law vs SLDAgent discovered law on SFT

Comparison of human-derived law vs. SLDAgent discovered law on the SFT scaling task

2) The Optimizer Matters as Much as the Formula

Many "pretty" formulas become useless if they're numerically brittle, hard to fit, or fail under extrapolation. SLDAgent explicitly searches over fitting strategy changes (multi-start, stability tricks, etc.) as part of the same evolutionary loop. Check our website for comprehensive cases.

Application: Hyperparameter Optimization for Pretraining (lr & batch size)

One of the most practical use cases is: "I'm pretraining a model, and I need good lr + batch size choices for my scale."

The paper points out a limitation of a common approach: prior work may run thousands of experiments but only keep a tiny set of "best points" to fit a scaling law for optimal hyperparameters. For instance, the original StepLaw [6] ran ~3,000 experiments but used only 17 optimal points to fit lr* and bsz*.

In contrast, SLDAgent discovers a law for the full loss surface L(N, D, lr, bsz), then derives optimal hyperparameters by setting partial derivatives (∂L/∂lr and ∂L/∂bsz) to zero, yielding a closed-form solution as a function of N and D.

Learning rate and batch size experiments

Finding the optimal learning rate and batch size: SLDAgent discovers laws for the full loss surface

That's exactly the kind of "turn experiments into a usable law" pipeline that scaling laws were meant for.

Toward AI scientist

Most benchmarks focus on math and coding, which are important but cannot evaluate AI's capability to develop itself. SLD is slightly different: it's a benchmark where the science of AI can be rigorously analyzed. SLDBench was built to demand symbolic reasoning, multi-context generalization, robust extrapolation, and long-horizon execution in a constrained environment…and to do so with a clear, unbiased objective function.

Our work could also serve as a diagnostic for general-purpose coding agents. Eventually, we hope the gap between task-specific systems and general agents should shrink, as those agents get better at scientific-style iteration.

          SLDBench is our first attempt to make "AI for AI research" programmable, benchmarkable, and eventually automatable.
        

References

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. "Training Compute-Optimal Large Language Models." arXiv preprint arXiv:2203.15556. [arXiv]
Haowei Lin, Baizhou Huang, Haotian Ye, et al. "Selecting Large Language Model to Fine-tune via Rectified Scaling Law." arXiv preprint arXiv:2402.02314. [arXiv]
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. "Scaling Laws for Neural Language Models." arXiv preprint arXiv:2001.08361. [arXiv]
Haowei Lin, Haotian Ye, Wenzheng Feng, et al. "Can Language Models Discover Scaling Laws?" arXiv preprint arXiv:2507.21184. [arXiv]
Asankhaya Sharma. "OpenEvolve: an open-source evolutionary coding agent." Software, 2025. [GitHub]
Houyi Li, Wenzhen Zheng, Qiufeng Wang, et al. "Predictable Scale: Part I, Step Law – Optimal Hyperparameter Scaling Law in Large Language Model Pre-training." arXiv preprint arXiv:2503.04715. [arXiv]