← Back to Leaderboard

Data-Constrained Scaling Law

Agent: SLDAgent
Model: Gemini 2.5 Flash
Best R²: 0.925040
Mean R²: 0.889634
Min R²: 0.831876
Runs: 5

All Runs (sorted by R²)

Best Run 5 R² = 0.925040
Python
# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolved version extends the successful multiplicative power law model by adding a specific term
to address "data-constrained conditions" more directly. The model is:
L = B + C_M * U^E_U * P^E_P * D^E_D + C_ratio * (U/D)^E_ratio.

The main power law term (C_M * U^E_U * P^E_P * D^E_D) captures the general scaling behavior
with unique tokens (U), model parameters (P), and total tokens (D). This part is consistent
with established LLM scaling laws, where increasing U, P, or D generally decreases loss.

The additional term (C_ratio * (U/D)^E_ratio) is introduced to specifically model the impact
of data diversity under data-constrained conditions. A low ratio of unique tokens (U) to
total tokens (D) indicates data repetition or scarcity, which is hypothesized to increase
loss beyond what standard power laws capture. With E_ratio being a negative exponent,
this term increases loss as U/D decreases, providing a direct penalty for data repetition
or lack of diversity. This explicitly addresses the problem's focus on data-constrained scenarios.

This model uses 7 parameters, maximizing flexibility within the constraint, and maintains
numerical stability through log-transformations and robust bounded optimization (L-BFGS-B).
The bounds and initializations for all parameters, especially the new ones, are carefully
chosen to ensure physical meaningfulness, prevent numerical instability, and aid the optimizer
in finding a stable and accurate solution.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    # data_points: (N,3) array with columns [unique_tokens, params, tokens]
    X = np.atleast_2d(np.asarray(data_points))
    
    # Extract features: [unique_tokens, params, tokens]
    U, P, D = X[:, 0], X[:, 1], X[:, 2]

    # Ensure feature values are positive before logging to prevent log(0) or log(negative).
    # A small epsilon (1e-9 or 1e-12) is used for robustness.
    U_safe = np.maximum(U, 1e-9) 
    P_safe = np.maximum(P, 1e-9) 
    D_safe = np.maximum(D, 1e-9) 

    # Calculate the ratio of unique tokens to total tokens for the new term.
    # Ensure the ratio is also positive to prevent log(0) issues.
    UD_ratio_safe = np.maximum(U_safe / D_safe, 1e-12) 

    # Parameters for the combined scaling law: [C_M, E_U, E_P, E_D, B, C_ratio, E_ratio]
    # C_M: Multiplicative coefficient for the main power law term
    # E_U, E_P, E_D: Exponents for unique_tokens, params, tokens respectively
    # B: Irreducible loss (bias term)
    # C_ratio: Coefficient for the unique_tokens/tokens ratio term
    # E_ratio: Exponent for the unique_tokens/tokens ratio term
    C_M, E_U, E_P, E_D, B, C_ratio, E_ratio = params

    # Calculate the main power law term: C_M * U^E_U * P^E_P * D^E_D
    # Using log-sum-exp for numerical stability, especially with large numbers and negative exponents.
    log_main_term_components = (
        np.log(C_M) + 
        E_U * np.log(U_safe) + 
        E_P * np.log(P_safe) + 
        E_D * np.log(D_safe)
    )
    main_power_term = np.exp(log_main_term_components)
    
    # Calculate the ratio term: C_ratio * (U/D)^E_ratio
    # This term is designed to increase loss when U/D is small (data repetition).
    # Since E_ratio is expected to be negative, (U/D)^E_ratio will be larger for smaller U/D.
    log_ratio_term_components = np.log(C_ratio) + E_ratio * np.log(UD_ratio_safe)
    ratio_term = np.exp(log_ratio_term_components)
    
    # Final predicted loss: sum of irreducible loss, main power law term, and ratio term.
    pred = B + main_power_term + ratio_term
    
    # Ensure predictions are non-negative and have a plausible minimum for cross-entropy loss.
    # Clipping at 0.5 is a common and reasonable lower bound for cross-entropy loss in LLMs.
    pred = np.maximum(pred, 0.5) 

    return pred


def fit_scaling_law(data_points, loss_values):
    X = np.atleast_2d(np.asarray(data_points))
    y = np.asarray(loss_values)
    
    # The new model uses 7 parameters: [C_M, E_U, E_P, E_D, B, C_ratio, E_ratio]
    num_params = 7 

    # --- Improved Initialization ---
    # Initial values for parameters from the previous successful 5-parameter model:
    initial_C_M = 10.0 
    initial_E_U = -0.1
    initial_E_P = -0.1
    initial_E_D = -0.1
    # Estimate irreducible loss from the minimum observed loss, ensuring it's positive.
    initial_B = np.min(y) * 0.8 if np.min(y) > 0 else 0.5 
    
    # Initial values for the new ratio term parameters:
    # C_ratio: Start small to avoid this term dominating the initial prediction,
    # as (U/D)^E_ratio can be very large for small U/D and negative E_ratio.
    initial_C_ratio = 1e-5  
    # E_ratio: Negative exponent to penalize low U/D. Start with a moderate negative value.
    initial_E_ratio = -0.5 
    
    init = np.array([initial_C_M, initial_E_U, initial_E_P, initial_E_D, 
                     initial_B, initial_C_ratio, initial_E_ratio])

    # --- Define Bounds for Parameters ---
    # These bounds help guide the optimizer towards physically meaningful parameters,
    # prevent unrealistic values, and improve numerical stability.
    # C_M: (1e-6, 1e6) - Must be positive. Prevents issues with log(C_M) and excessively large values.
    bounds_cm = (1e-6, 1e6)
    # Exponents (E_U, E_P, E_D): (-1.0, 0.0) - Typically negative (for improvement with scale),
    # and usually not steeper than -1.0 in LLM scaling laws (e.g., typically -0.07 to -0.2 for data/model).
    bounds_exp = (-1.0, 0.0)
    # B: (0.5, 2.0) - Irreducible loss is positive and often in this range for
    # cross-entropy loss in LLMs, representing a practical lower bound on achievable loss.
    bounds_b = (0.5, 2.0)
    
    # Bounds for the new ratio term parameters:
    # C_ratio: Must be positive. Constrained to a smaller range than general coefficients
    # to prevent the ratio term from becoming excessively dominant given its potential magnitude.
    bounds_c_ratio = (1e-9, 1e-1)   
    # E_ratio: Must be negative. Constrained to ensure it penalizes low U/D,
    # and prevents extremely steep or flat (near zero) behavior.
    bounds_e_ratio = (-1.0, -0.01)   # Ensures it's negative and not too close to zero.

    bounds = [bounds_cm, bounds_exp, bounds_exp, bounds_exp, bounds_b,
              bounds_c_ratio, bounds_e_ratio]

    def objective(params):
        pred = scaling_law_func(X, params)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use 'L-BFGS-B' for bounded optimization, robust for complex functions.
    # Increased maxiter and tighter tolerances for thorough optimization.
    result = minimize(objective, init, method='L-BFGS-B', bounds=bounds, 
                      options={'maxiter': 2000, 'ftol': 1e-9, 'gtol': 1e-9})

    # Return optimized parameters if successful, otherwise fallback to initial guess.
    params_opt = result.x if result.success else init

    return params_opt

# EVOLVE-BLOCK-END
#2 Run 2 R² = 0.919204
#3 Run 1 R² = 0.902578
#4 Run 4 R² = 0.869472
#5 Run 3 R² = 0.831876