AI scientists for AI research could eventually automate AI self-improvement effectively. But as an early-stage concept, how it can be reliably approached remains underexplored. This blog introduces our recent research toward this goal from an interesting yet important question:
Our new work on Scaling Law Discovery (SLD) is a benchmark on a variety of scaling laws studies and an evolution-based agentic framework for automating the process of discovering scaling laws from experimental data. As we will show below, and perhaps surprisingly, our SLD agent can already predict the scaling behaviors better than humans.
Scaling Law Discovery: What, Why, and Challenge
A scaling law is an empirical relationship that predicts a model's performance (loss, perplexity, accuracy, etc.) as a function of "scale variables", such as:
- Model size (N)
- Dataset size (D)
- Compute / FLOPs (C)
- Hyperparameters like learning rate and batch size
- Architecture knobs (e.g., number of experts in MoE)
The canonical example is the "Chinchilla-style" pretraining form [1], where loss is modeled as a power-law function of model size and data size:
The Chinchilla-style scaling law
The predictive capability has helped with plenty of real decisions, including compute-optimal planning [1], choosing which model to fine-tune [2], and picking hyperparameters that would otherwise require huge sweeps [3].
Challenge: A real example.
In 2023, we ran into the problem of selecting the most appropriate LLM for downstream fine-tuning tasks. Specifically, given a set of pretrained LLMs and only a limited subset of downstream data, how can we predict which model will perform best after full fine-tuning?
Inspired by established scaling laws, we initially attempted to fit a "Chinchilla-style" power law to predict fine-tuning performance. Unfortunately, we observed a consistent phase transition pattern—what we termed a pre-power phase followed by a power phase. Existing power-law formulations simply couldn't capture this behavior accurately.
The two-phase behavior observed in fine-tuning: pre-power phase followed by power phase
We ended up dedicating a full paper to investigating scaling laws in this scenario and proposed a "rectified" law. This new formulation incorporates an additional property—the effective size of pre-learned data—that was absent in previous laws. Empirically, the fit was significantly better: our paper [2] reports an average RMSD of 0.007 for the rectified law versus 0.036 for the vanilla law (using the same number of fitted parameters).
With two years of progress in AI, scaling laws are everywhere:
- Supervised fine-tuning laws
- Vocabulary-size laws
- MoE laws
- Domain-mixture laws
- Parallelism laws
- Learning-rate & batch-size laws
- U-shaped/double-descent patterns
The SLD paper [4] summarizes this clearly: the scope has expanded rapidly, but discoveries are still manual and case-specific, requiring repeated cycles of hypothesis generation (design the symbolic law expression) and experimentation (fit the law to observed datapoints). we want to go back to this problem, but in a fundamentally different way: we want AI to discover scaling laws automateically.
Introducing Scaling Law Discovery (SLD)
Our work frames scaling law discovery as: given observed experimental trials, output
- A symbolic formula (the law form), and
- An optimizer that robustly fits coefficients on seen data, so the resulting parameterized law extrapolates well.
More precisely, the input is a dataset of trials where:
- x are feature variables (e.g., model size N, dataset size D, lr, bsz)
- y is the target metric (e.g., loss)
- c is a control index (a setting like "which model" or "which corpus"), where all settings share the same law form but have different fitted coefficients.
That "shared form, different coefficients" detail sounds small, but it's exactly what makes real scaling-law work different from many synthetic symbolic regression tasks.
SLDBench: A Benchmark with 5,000+ Real Experiments
To evaluate this problem rigorously, we curated SLDBench, a scaling law discovery testbed built from over 5,000 LLM training experiments collected from existing scaling-law literature.
SLDBench includes tasks spanning a wide range of scaling scenarios:
Table 1: Overview of SLDBench tasks spanning diverse scaling scenarios
Why SLDBench is a good "Scientific Discovery" Benchmark
- The agent gets the experimental results; it does not need to run heavy training.
- The score is continuous and objective: extrapolation on held-out "large-scale" settings.
- There is no learned reward model, and the "true best law" is unknown even to human experts.
- The evaluation environment is a sandbox terminal: agents output a law.py, and we compute metrics (NMSE, NMAE, R²) on an extrapolation test split.
This makes SLDBench a natural yardstick for "can agents do long-horizon, research-like iteration?"
SLDAgent: Co-evolving a Law and Its Optimizer
A key lesson from our own experience (and from watching others do this) is:
So our agent, SLDAgent, evolves two coupled components:
- Expression: the law form f(x; θ)
- Optimization: a robust routine that fits θ
It starts from a baseline (often power-law + BFGS), then iteratively proposes code modifications, executes them, and keeps the best variants while maintaining diversity via a multi-island + MAP-Elites style evolutionary database. (Implementation note: the system is built on the OpenEvolve framework [5].)
SLDAgent scaffolding: an evolutionary pipeline that co-evolves law expressions and fitting procedures
This "co-optimization" framing matches the real research loop: hypothesize → fit → evaluate → refine.
Our Agent Can Beat Human Discoveries
Across SLDBench, SLDAgent achieves state-of-the-art performance and "superhuman" results relative to human baselines and existing agent systems.
SLDAgent outperforms human-derived laws across benchmark tasks
But the more interesting part is how it wins:
1) Better Laws Are Often More "Principled," Not Just More Complex
For example, in the SFT task, SLDAgent discovers a saturated power-law parameterization where the "pre-learned data size" appears as a dimensionless ratio (D/θ₃), making θ₃ retain the natural unit of dataset size and improving interpretability and downstream usability.
Comparison of human-derived law vs. SLDAgent discovered law on the SFT scaling task
2) The Optimizer Matters as Much as the Formula
Many "pretty" formulas become useless if they're numerically brittle, hard to fit, or fail under extrapolation. SLDAgent explicitly searches over fitting strategy changes (multi-start, stability tricks, etc.) as part of the same evolutionary loop. Check our website for comprehensive cases.
Application: Hyperparameter Optimization for Pretraining (lr & batch size)
One of the most practical use cases is: "I'm pretraining a model, and I need good lr + batch size choices for my scale."
The paper points out a limitation of a common approach: prior work may run thousands of experiments but only keep a tiny set of "best points" to fit a scaling law for optimal hyperparameters. For instance, the original StepLaw [6] ran ~3,000 experiments but used only 17 optimal points to fit lr* and bsz*.
In contrast, SLDAgent discovers a law for the full loss surface L(N, D, lr, bsz), then derives optimal hyperparameters by setting partial derivatives (∂L/∂lr and ∂L/∂bsz) to zero, yielding a closed-form solution as a function of N and D.
Finding the optimal learning rate and batch size: SLDAgent discovers laws for the full loss surface
That's exactly the kind of "turn experiments into a usable law" pipeline that scaling laws were meant for.
Toward AI scientist
Most benchmarks focus on math and coding, which are important but cannot evaluate AI's capability to develop itself. SLD is slightly different: it's a benchmark where the science of AI can be rigorously analyzed. SLDBench was built to demand symbolic reasoning, multi-context generalization, robust extrapolation, and long-horizon execution in a constrained environment…and to do so with a clear, unbiased objective function.
Our work could also serve as a diagnostic for general-purpose coding agents. Eventually, we hope the gap between task-specific systems and general agents should shrink, as those agents get better at scientific-style iteration.
References
- Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. "Training Compute-Optimal Large Language Models." arXiv preprint arXiv:2203.15556. [arXiv]
- Haowei Lin, Baizhou Huang, Haotian Ye, et al. "Selecting Large Language Model to Fine-tune via Rectified Scaling Law." arXiv preprint arXiv:2402.02314. [arXiv]
- Jared Kaplan, Sam McCandlish, Tom Henighan, et al. "Scaling Laws for Neural Language Models." arXiv preprint arXiv:2001.08361. [arXiv]
- Haowei Lin, Haotian Ye, Wenzheng Feng, et al. "Can Language Models Discover Scaling Laws?" arXiv preprint arXiv:2507.21184. [arXiv]
- Asankhaya Sharma. "OpenEvolve: an open-source evolutionary coding agent." Software, 2025. [GitHub]
- Houyi Li, Wenzhen Zheng, Qiufeng Wang, et al. "Predictable Scale: Part I, Step Law – Optimal Hyperparameter Scaling Law in Large Language Model Pre-training." arXiv preprint arXiv:2503.04715. [arXiv]