SLD - Data-Constrained Scaling Law - SLDAgent + Gemini 2.5 Flash

All Runs (sorted by R²)

Best Run 5 R² = 0.925040

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolved version extends the successful multiplicative power law model by adding a specific term
to address "data-constrained conditions" more directly. The model is:
L = B + C_M * U^E_U * P^E_P * D^E_D + C_ratio * (U/D)^E_ratio.

The main power law term (C_M * U^E_U * P^E_P * D^E_D) captures the general scaling behavior
with unique tokens (U), model parameters (P), and total tokens (D). This part is consistent
with established LLM scaling laws, where increasing U, P, or D generally decreases loss.

The additional term (C_ratio * (U/D)^E_ratio) is introduced to specifically model the impact
of data diversity under data-constrained conditions. A low ratio of unique tokens (U) to
total tokens (D) indicates data repetition or scarcity, which is hypothesized to increase
loss beyond what standard power laws capture. With E_ratio being a negative exponent,
this term increases loss as U/D decreases, providing a direct penalty for data repetition
or lack of diversity. This explicitly addresses the problem's focus on data-constrained scenarios.

This model uses 7 parameters, maximizing flexibility within the constraint, and maintains
numerical stability through log-transformations and robust bounded optimization (L-BFGS-B).
The bounds and initializations for all parameters, especially the new ones, are carefully
chosen to ensure physical meaningfulness, prevent numerical instability, and aid the optimizer
in finding a stable and accurate solution.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    # data_points: (N,3) array with columns [unique_tokens, params, tokens]
    X = np.atleast_2d(np.asarray(data_points))
    
    # Extract features: [unique_tokens, params, tokens]
    U, P, D = X[:, 0], X[:, 1], X[:, 2]

    # Ensure feature values are positive before logging to prevent log(0) or log(negative).
    # A small epsilon (1e-9 or 1e-12) is used for robustness.
    U_safe = np.maximum(U, 1e-9) 
    P_safe = np.maximum(P, 1e-9) 
    D_safe = np.maximum(D, 1e-9) 

    # Calculate the ratio of unique tokens to total tokens for the new term.
    # Ensure the ratio is also positive to prevent log(0) issues.
    UD_ratio_safe = np.maximum(U_safe / D_safe, 1e-12) 

    # Parameters for the combined scaling law: [C_M, E_U, E_P, E_D, B, C_ratio, E_ratio]
    # C_M: Multiplicative coefficient for the main power law term
    # E_U, E_P, E_D: Exponents for unique_tokens, params, tokens respectively
    # B: Irreducible loss (bias term)
    # C_ratio: Coefficient for the unique_tokens/tokens ratio term
    # E_ratio: Exponent for the unique_tokens/tokens ratio term
    C_M, E_U, E_P, E_D, B, C_ratio, E_ratio = params

    # Calculate the main power law term: C_M * U^E_U * P^E_P * D^E_D
    # Using log-sum-exp for numerical stability, especially with large numbers and negative exponents.
    log_main_term_components = (
        np.log(C_M) + 
        E_U * np.log(U_safe) + 
        E_P * np.log(P_safe) + 
        E_D * np.log(D_safe)
    )
    main_power_term = np.exp(log_main_term_components)
    
    # Calculate the ratio term: C_ratio * (U/D)^E_ratio
    # This term is designed to increase loss when U/D is small (data repetition).
    # Since E_ratio is expected to be negative, (U/D)^E_ratio will be larger for smaller U/D.
    log_ratio_term_components = np.log(C_ratio) + E_ratio * np.log(UD_ratio_safe)
    ratio_term = np.exp(log_ratio_term_components)
    
    # Final predicted loss: sum of irreducible loss, main power law term, and ratio term.
    pred = B + main_power_term + ratio_term
    
    # Ensure predictions are non-negative and have a plausible minimum for cross-entropy loss.
    # Clipping at 0.5 is a common and reasonable lower bound for cross-entropy loss in LLMs.
    pred = np.maximum(pred, 0.5) 

    return pred


def fit_scaling_law(data_points, loss_values):
    X = np.atleast_2d(np.asarray(data_points))
    y = np.asarray(loss_values)
    
    # The new model uses 7 parameters: [C_M, E_U, E_P, E_D, B, C_ratio, E_ratio]
    num_params = 7 

    # --- Improved Initialization ---
    # Initial values for parameters from the previous successful 5-parameter model:
    initial_C_M = 10.0 
    initial_E_U = -0.1
    initial_E_P = -0.1
    initial_E_D = -0.1
    # Estimate irreducible loss from the minimum observed loss, ensuring it's positive.
    initial_B = np.min(y) * 0.8 if np.min(y) > 0 else 0.5 
    
    # Initial values for the new ratio term parameters:
    # C_ratio: Start small to avoid this term dominating the initial prediction,
    # as (U/D)^E_ratio can be very large for small U/D and negative E_ratio.
    initial_C_ratio = 1e-5  
    # E_ratio: Negative exponent to penalize low U/D. Start with a moderate negative value.
    initial_E_ratio = -0.5 
    
    init = np.array([initial_C_M, initial_E_U, initial_E_P, initial_E_D, 
                     initial_B, initial_C_ratio, initial_E_ratio])

    # --- Define Bounds for Parameters ---
    # These bounds help guide the optimizer towards physically meaningful parameters,
    # prevent unrealistic values, and improve numerical stability.
    # C_M: (1e-6, 1e6) - Must be positive. Prevents issues with log(C_M) and excessively large values.
    bounds_cm = (1e-6, 1e6)
    # Exponents (E_U, E_P, E_D): (-1.0, 0.0) - Typically negative (for improvement with scale),
    # and usually not steeper than -1.0 in LLM scaling laws (e.g., typically -0.07 to -0.2 for data/model).
    bounds_exp = (-1.0, 0.0)
    # B: (0.5, 2.0) - Irreducible loss is positive and often in this range for
    # cross-entropy loss in LLMs, representing a practical lower bound on achievable loss.
    bounds_b = (0.5, 2.0)
    
    # Bounds for the new ratio term parameters:
    # C_ratio: Must be positive. Constrained to a smaller range than general coefficients
    # to prevent the ratio term from becoming excessively dominant given its potential magnitude.
    bounds_c_ratio = (1e-9, 1e-1)   
    # E_ratio: Must be negative. Constrained to ensure it penalizes low U/D,
    # and prevents extremely steep or flat (near zero) behavior.
    bounds_e_ratio = (-1.0, -0.01)   # Ensures it's negative and not too close to zero.

    bounds = [bounds_cm, bounds_exp, bounds_exp, bounds_exp, bounds_b,
              bounds_c_ratio, bounds_e_ratio]

    def objective(params):
        pred = scaling_law_func(X, params)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use 'L-BFGS-B' for bounded optimization, robust for complex functions.
    # Increased maxiter and tighter tolerances for thorough optimization.
    result = minimize(objective, init, method='L-BFGS-B', bounds=bounds, 
                      options={'maxiter': 2000, 'ftol': 1e-9, 'gtol': 1e-9})

    # Return optimized parameters if successful, otherwise fallback to initial guess.
    params_opt = result.x if result.success else init

    return params_opt

# EVOLVE-BLOCK-END

#2 Run 2 R² = 0.919204

▼

Python

# EVOLVE-BLOCK-START
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts loss values based on a comprehensive scaling law designed to model
    LLM training under varying data diversity and repetition, particularly
    addressing data-constrained conditions.

    The model combines a general multi-variate power law with an additive term
    specifically penalizing data repetition and an irreducible loss component.

    Scaling Law Form (7 parameters):
    Loss = K * (P^alpha) * (D^beta) * (U^gamma) + C_r * (D/U)^delta + B

    Where:
    - P: parameter count (params)
    - D: total token count (tokens)
    - U: unique token count (unique_tokens)

    Parameters:
    - data_points: (N,3) array with columns [unique_tokens, params, tokens]
                   (Corresponds to U, P, D respectively)
    - params: Array of 7 parameters: [log_K, alpha, beta, gamma, log_Cr, delta, log_B]
              These parameters are designed for stable optimization by log-transforming
              coefficients (K, Cr, B) that must be positive.

    Returns:
    - Predicted loss values (1D array of N values)
    """
    X = np.atleast_2d(np.asarray(data_points))
    U = X[:, 0]  # unique_tokens
    P = X[:, 1]  # params
    D = X[:, 2]  # tokens

    # Ensure numerical stability for log and power operations.
    # Input features (U, P, D) are counts and should be positive.
    # A small floor (1e-10) is applied to prevent log(0) or division by zero,
    # which can occur with synthetic or edge-case data, although not expected
    # with the given large-scale data. This enhances robustness.
    U_safe = np.maximum(U, 1e-10)
    P_safe = np.maximum(P, 1e-10)
    D_safe = np.maximum(D, 1e-10)

    # Unpack parameters.
    # log_K, log_Cr, log_B are optimized in log-space to ensure K, Cr, B are positive.
    log_K, alpha, beta, gamma, log_Cr, delta, log_B = params

    # Transform log-parameters back to their original scale for calculation.
    K = np.exp(log_K)
    Cr = np.exp(log_Cr)
    B = np.exp(log_B)

    # Term 1: K * (P^alpha) * (D^beta) * (U^gamma)
    # This is the primary scaling law component, capturing the multiplicative
    # effects of model size (P), total compute (D), and data diversity (U).
    # Exponents (alpha, beta, gamma) are typically negative, indicating that
    # increasing P, D, or U generally decreases loss.
    # Log-space calculation for powers (exp(log(base)*exponent)) is numerically
    # more stable, especially for very large bases or potentially negative exponents.
    log_term1_components = alpha * np.log(P_safe) + \
                           beta * np.log(D_safe) + \
                           gamma * np.log(U_safe)
    term1 = K * np.exp(log_term1_components)

    # Term 2: C_r * (D/U)^delta
    # This additive term specifically addresses data-constrained scenarios.
    # D/U represents the average repetition ratio of tokens.
    # A high D/U (many total tokens for few unique tokens) implies significant
    # data repetition, which is known to hinder further loss reduction.
    # 'delta' is expected to be positive, so higher repetition (larger D/U)
    # increases loss. 'Cr' scales the magnitude of this penalty.
    repetition_ratio = D_safe / U_safe
    # Apply floor to repetition_ratio to ensure numerical stability for power operation.
    repetition_ratio_safe = np.maximum(repetition_ratio, 1e-10)
    term2 = Cr * (repetition_ratio_safe ** delta)

    # Total predicted loss is the sum of the scaling components and the irreducible loss.
    pred = term1 + term2 + B

    # Cross-entropy loss is inherently non-negative.
    # Apply a small positive floor to predictions to prevent physically impossible
    # negative loss values that might arise from numerical inaccuracies during optimization.
    return np.maximum(pred, 1e-6)


def fit_scaling_law(data_points, loss_values):
    """
    Optimizes the parameters of the scaling law function to best fit the
    given data points and loss values.

    This function employs the L-BFGS-B algorithm, a robust quasi-Newton method
    that efficiently handles parameter bounds. Initial parameter guesses and
    bounds are carefully selected based on theoretical understanding of LLM
    scaling laws to ensure numerical stability, accelerate convergence, and
    yield physically plausible results.

    Parameters:
    - data_points: (N,3) array with columns [unique_tokens, params, tokens]
    - loss_values: Array of corresponding loss values (1D array)

    Returns:
    - Optimized parameters (1D array of 7 parameters)
    """
    X = np.atleast_2d(np.asarray(data_points))
    y = np.asarray(loss_values)

    # Initial guess for parameters.
    # These values are informed by empirical observations in LLM scaling literature
    # and the specific interpretation of each term in our model.
    initial_params = np.array([
        np.log(50.0),   # log_K: Coefficient for the main product term. A common starting value.
        -0.07,          # alpha: Exponent for 'params'. Expected negative (more params -> lower loss).
        -0.3,           # beta: Exponent for 'tokens'. Expected negative (more tokens -> lower loss), often larger magnitude than alpha.
        -0.05,          # gamma: Exponent for 'unique_tokens' in the product term. Expected negative.
        np.log(0.1),    # log_Cr: Coefficient for the additive repetition term. Cr must be positive.
        0.5,            # delta: Exponent for 'D/U' (repetition ratio). Expected positive (higher repetition -> higher loss).
        np.log(1.5)     # log_B: Irreducible loss. Typically a small positive value for cross-entropy.
    ])

    # Bounds for L-BFGS-B optimization.
    # These bounds are crucial for guiding the optimizer towards meaningful physical ranges,
    # enhancing stability, and preventing unrealistic parameter values.
    bounds = [
        (None, None),               # log_K: Unbounded, as K (exp(log_K)) must be positive.
        (-0.5, -1e-8),              # alpha: Must be negative (more params reduces loss). Tighter upper bound to reflect common observed values.
        (-0.8, -1e-8),              # beta: Must be negative (more tokens reduces loss). Broader negative range for beta.
        (-0.5, -1e-8),              # gamma: Must be negative (more unique tokens reduces loss). Tighter upper bound.
        (None, None),               # log_Cr: Unbounded, as Cr (exp(log_Cr)) must be positive.
        (1e-8, 3.0),                # delta: Must be positive (more repetition increases loss). Bounded to prevent extreme values, 3.0 is a reasonable upper limit.
        (np.log(0.1), np.log(4.0))  # log_B: B (irreducible loss) must be positive. Bounded within a typical range for CE loss (0.1 to 4.0).
    ]

    def objective(params):
        # Calculate predicted loss using the current parameters.
        pred = scaling_law_func(X, params)
        # Calculate Mean Squared Error (MSE) as the objective function to minimize.
        # MSE is a standard and effective metric for regression problems, providing
        # a smooth, differentiable surface for the optimizer.
        mse = np.mean((pred - y) ** 2)
        return mse

    # Perform optimization using L-BFGS-B.
    # This method is chosen for its efficiency, ability to handle bounds,
    # and suitability for problems with a moderate number of parameters.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds)

    # Return the optimized parameters if the optimization was successful.
    # If optimization fails (e.g., due to convergence issues), return the initial
    # guess to ensure a valid parameter set is always provided, preventing downstream errors.
    return result.x if result.success else initial_params
# EVOLVE-BLOCK-END

#3 Run 1 R² = 0.902578

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM training scenarios under data-constrained conditions.
This evolved version refines the scaling law function to use log-transformed power terms
for improved numerical stability and updates the optimization algorithm with an
informed initial guess for coefficients and robust bounds, drawing inspiration from
high-performing models in the evolution history.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts loss based on unique tokens, model parameters, and total tokens.
    The scaling law used is of the form:
    Loss = L0 + CU * (unique_tokens)^(-alphaU) + CP * (params)^(-alphaP) + CT * (tokens)^(-alphaT)

    Parameters:
    - data_points: (N,3) array with columns [unique_tokens, params, tokens].
                   These are typically large positive numbers.
    - params: Array of 7 parameters: [L0, CU, alphaU, CP, alphaP, CT, alphaT].
              - L0: The irreducible loss floor. Should be positive.
              - CU, CP, CT: Positive coefficients for the power law terms.
              - alphaU, alphaP, alphaT: Positive exponents for the inverse power law terms.

    Returns:
    - Predicted loss values (N,) array.
    """
    X = np.atleast_2d(np.asarray(data_points))

    # Unpack parameters according to the defined structure
    # params: [L0, CU, alphaU, CP, alphaP, CT, alphaT]
    L0, CU, alphaU, CP, alphaP, CT, alphaT = params

    unique_tokens = X[:, 0]
    model_params = X[:, 1]
    tokens = X[:, 2]

    # Add a small epsilon to the base to handle potential zero values and
    # ensure numerical stability for np.log, though data ranges are positive.
    epsilon = 1e-10 

    # Calculate the inverse power law terms using log-transform for improved numerical stability.
    # Exponents are negated because alphaU, alphaP, alphaT are expected to be positive,
    # leading to terms that decrease with increasing unique_tokens, params, or tokens.
    term_U = CU * np.exp(-alphaU * np.log(unique_tokens + epsilon))
    term_P = CP * np.exp(-alphaP * np.log(model_params + epsilon))
    term_T = CT * np.exp(-alphaT * np.log(tokens + epsilon))

    # Sum up the terms to get the predicted loss
    pred_loss = L0 + term_U + term_P + term_T
    
    # Removed the explicit clipping of pred_loss to 0.5.
    # Instead, the objective function will penalize severely negative predictions.
    
    return pred_loss


def fit_scaling_law(data_points, loss_values):
    """
    Fits the scaling law function to the provided data using bounded optimization.

    Parameters:
    - data_points: (N,3) array with columns [unique_tokens, params, tokens].
    - loss_values: (N,) array of corresponding loss values.

    Returns:
    - Optimized parameters: [L0, CU, alphaU, CP, alphaP, CT, alphaT].
    """
    X = np.asarray(data_points)
    y = np.asarray(loss_values)

    # Define the objective function (Mean Squared Error) for optimization
    def objective(params):
        predicted_loss = scaling_law_func(X, params)
        
        # Calculate Mean Squared Error
        mse = np.mean((predicted_loss - y) ** 2)
        
        # Add a penalty for non-finite or unrealistic loss predictions.
        # This allows the optimizer to explore slightly negative values before a hard penalty,
        # which can help in escaping local minima.
        if not np.isfinite(mse) or np.any(predicted_loss < -100): # Penalize very negative predictions
            return 1e12 # Return a very large value to strongly penalize bad predictions
        
        return mse

    # --- Improved Initial Guess for Parameters ---
    # Parameters: [L0, CU, alphaU, CP, alphaP, CT, alphaT]

    # L0: Irreducible loss floor. Typically positive and below the minimum observed loss.
    min_loss_y = np.min(y)
    initial_L0 = max(0.01, min_loss_y * 0.8) # Based on high-performing program's heuristic

    # CU, CP, CT: Coefficients. These can be large because the X^(-alpha) terms are very small.
    # A larger initial value provides better exploration for these coefficients.
    initial_C = 1000.0 # Adopted from high-performing program

    # alphaU, alphaP, alphaT: Exponents. Typically small positive values (e.g., 0.05 to 0.2).
    initial_alpha = 0.1
    
    initial_params = np.array([
        initial_L0,
        initial_C, initial_alpha,  # CU, alphaU
        initial_C, initial_alpha,  # CP, alphaP
        initial_C, initial_alpha   # CT, alphaT
    ])

    # --- Define Bounds for Parameters ---
    # Bounds help guide the optimizer to physically meaningful regions and improve stability.
    # Parameters: [L0, CU, alphaU, CP, alphaP, CT, alphaT]
    bounds = [
        (0.001, np.max(y) + 1.0), # L0: Must be positive, up to slightly above max observed loss
        (1e-9, None),             # CU: Positive, unbounded above to allow for large coefficients
        (1e-9, 2.0),              # alphaU: Positive, typically < 1, allow up to 2.0
        (1e-9, None),             # CP: Positive, unbounded above
        (1e-9, 2.0),              # alphaP: Positive, typically < 1, allow up to 2.0
        (1e-9, None),             # CT: Positive, unbounded above
        (1e-9, 2.0)               # alphaT: Positive, typically < 1, allow up to 2.0
    ]
    # Smallest positive lower bound 1e-9 to avoid numerical issues (e.g., log(0)).

    # --- Optimization using L-BFGS-B ---
    # L-BFGS-B is chosen for its ability to handle bounds effectively,
    # which is crucial for the physical constraints of scaling law parameters.
    # Optimizer options (maxiter, ftol) are tuned for robust convergence.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds,
                      options={'disp': False, 'maxiter': 2000, 'ftol': 1e-9}) 

    # Return optimized parameters if successful, otherwise the initial guess as a fallback.
    params_opt = result.x if result.success else initial_params
    
    # Final clip to bounds as a safeguard, in case optimizer returns values slightly outside bounds
    # (e.g., due to floating point precision or stopping criteria).
    params_opt = np.clip(params_opt, 
                         [b[0] if b[0] is not None else -np.inf for b in bounds],
                         [b[1] if b[1] is not None else np.inf for b in bounds])

    return params_opt
# EVOLVE-BLOCK-END

#4 Run 4 R² = 0.869472

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This version refines the scaling law function and optimization algorithm
to improve numerical stability, physical plausibility, and fitting accuracy
under data-constrained conditions, adhering to the 7-parameter limit.

The scaling law function is evolved to a hybrid form:
L = C_mult * U^e_U * P^e_P * D^e_D_mult + C_add * D^e_D_add + B
where:
U = unique_tokens
P = parameters
D = tokens
C_mult, e_U, e_P, e_D_mult: parameters for the multiplicative term
C_add, e_D_add: parameters for the additive term (focused on tokens)
B: irreducible loss (bias)

This form allows for both multiplicative interaction between resources and a
separate additive scaling with the total number of tokens, which is often a dominant factor.
This hybrid model aims to better capture complex scaling behaviors observed in LLMs,
especially under varying data and parameter regimes, while maintaining parameter efficiency.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    X = np.atleast_2d(np.asarray(data_points)) # (N, 3) array: [unique_tokens, params, tokens]
    
    # Ensure X values are positive before logging to prevent log(0) or log(negative).
    # Clamping at 1e-9 provides robustness for very small (theoretical) inputs.
    U = np.maximum(X[:, 0], 1e-9) # Unique tokens
    P = np.maximum(X[:, 1], 1e-9) # Parameters
    D = np.maximum(X[:, 2], 1e-9) # Tokens

    # Parameters are ordered as: [C_mult, e_U, e_P, e_D_mult, C_add, e_D_add, B]
    # This function expects a single set of 7 parameters.
    C_mult, e_U, e_P, e_D_mult, C_add, e_D_add, B = params

    # Calculate log of features for numerical stability in power calculations (X^e = exp(e * log(X)))
    log_U = np.log(U)
    log_P = np.log(P)
    log_D = np.log(D)

    # Multiplicative term: C_mult * U^e_U * P^e_P * D^e_D_mult
    # This term models the joint impact of all three resources.
    log_mult_term_components = (e_U * log_U + e_P * log_P + e_D_mult * log_D)
    term_mult = C_mult * np.exp(log_mult_term_components)
    
    # Additive term: C_add * D^e_D_add
    # This term provides an additional, independent scaling component specifically for total tokens,
    # which is often a primary driver of performance.
    term_add = C_add * np.exp(e_D_add * log_D)
    
    # Total predicted loss is the sum of the multiplicative term, additive term, and irreducible bias.
    pred = term_mult + term_add + B

    # Ensure predictions are non-negative and have a plausible minimum.
    # Cross-entropy loss cannot be negative. 0.5 is a common empirical lower bound for many LLM tasks
    # (e.g., random guessing with 2 classes gives log(2) approx 0.69, or for a large number of classes,
    # the ideal minimum loss is 0, but practically it's often above 0.5).
    pred = np.maximum(pred, 0.5) 

    return pred


def fit_scaling_law(data_points, loss_values):
    X = np.atleast_2d(np.asarray(data_points)) # (N, 3) array: [unique_tokens, params, tokens]
    y = np.asarray(loss_values) # (N,) array of loss values
    
    # --- Initialization Strategy ---
    # Parameters: [C_mult, e_U, e_P, e_D_mult, C_add, e_D_add, B] (7 parameters total)
    
    # Estimate initial bias (B): a fraction of the minimum observed loss, clamped at a reasonable minimum.
    init_B = max(0.5, np.min(y) * 0.9) 
    
    # Initial exponents (e_i): typically negative for scaling laws (loss decreases as resource increases).
    # Adjusted to be more varied and potentially stronger, allowing for a wider search space.
    init_e_U = -0.15      # Stronger initial exponent for unique_tokens, reflecting its importance.
    init_e_P = -0.10      # Slightly stronger initial exponent for parameters.
    init_e_D_mult = -0.08 # Initial exponent for tokens in the multiplicative term.
    init_e_D_add = -0.18  # Stronger initial exponent for tokens in the additive term, to differentiate its role.

    # Initial coefficients (C_i): Start with larger positive values.
    # Given the large input feature values (U, P, D) and negative exponents,
    # C values often need to be substantial to produce loss values in the observed range (1.8-7.2).
    init_C_mult = 1000.0 # Coefficient for the multiplicative term
    init_C_add = 100.0   # Coefficient for the additive term

    # Combine all initial parameters into a single array for the optimizer.
    initial_params = np.array([init_C_mult, init_e_U, init_e_P, init_e_D_mult, init_C_add, init_e_D_add, init_B])

    # --- Define Bounds for Parameters ---
    # These bounds are crucial for guiding the optimizer towards physically meaningful parameters,
    # preventing unrealistic values, and improving numerical stability and convergence.
    bounds = (
        (1e-9, 1e5),    # C_mult: Must be positive, allows for a wide range of magnitudes.
        (-1.0, 0.0),    # e_U: Loosened to allow for steeper scaling (exponents typically between -1.0 and 0.0).
        (-1.0, 0.0),    # e_P: Loosened to allow for steeper scaling.
        (-1.0, 0.0),    # e_D_mult: Loosened to allow for steeper scaling.
        (1e-9, 1e5),    # C_add: Must be positive, allows for a wide range of magnitudes.
        (-1.0, 0.0),    # e_D_add: Loosened to allow for steeper scaling.
        (0.5, 2.5)      # B: Irreducible loss. Must be positive, and within a plausible range (0.5 to 2.5)
                        # given observed loss values (1.8 to 7.2).
    )

    def objective(params):
        """
        The objective function to minimize. Calculates Mean Squared Error (MSE)
        between predicted and actual loss values.
        """
        pred = scaling_law_func(X, params)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use 'L-BFGS-B' method, which is suitable for bounded optimization problems.
    # Increased maxiter and tightened ftol/gtol to encourage more thorough optimization
    # and potentially achieve better convergence, given the complexity of the model.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds, 
                      options={'maxiter': 2000, 'ftol': 1e-7, 'gtol': 1e-7})

    # Return the optimized parameters if the optimization was successful (result.success is True),
    # otherwise return the initial guess as a fallback to ensure a valid return value.
    return result.x if result.success else initial_params

# EVOLVE-BLOCK-END

#5 Run 3 R² = 0.831876

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolved version introduces a new scaling law function that explicitly models
a multiplicative interaction between unique tokens, parameters, and total tokens,
plus an additional additive power-law term for parameters and an irreducible bias.
This functional form is often more robust for modeling LLM scaling, especially
under data-constrained conditions where the interplay of resources is critical.

The `scaling_law_func` now uses 7 parameters to model:
L = C1 * unique_tokens^a * params^b * tokens^c + C2 * params^e + B

This form captures:
-   `C1 * unique_tokens^a * params^b * tokens^c`: A primary scaling term where unique data, model capacity, and total compute (tokens) interact multiplicatively. This is characteristic of many modern LLM scaling laws (e.g., Chinchilla).
-   `C2 * params^e`: An additional additive term that models the standalone effect of model capacity, potentially capturing aspects not fully covered by the multiplicative term or providing a more direct scaling component for `params`.
-   `B`: An irreducible loss term, representing the theoretical minimum loss.

The `fit_scaling_law` function is significantly refined with:
1.  **Tailored Initialization**: Initial parameter guesses are carefully chosen based on typical LLM scaling behaviors and the specific roles of parameters in the new functional form. For instance, coefficients `C1` and `C2` are initialized to moderate values to prevent initial predictions from being excessively high or low given the typical range of exponents and input values.
2.  **Strict and Realistic Bounds**: Tighter and more specific bounds for each parameter are applied. These bounds are derived from common observations in LLM scaling literature and the expected physical meaning of each parameter (e.g., exponents for resource scaling are negative, coefficients are positive, irreducible loss is within a plausible range for cross-entropy loss). These bounds guide the 'L-BFGS-B' optimizer towards stable and physically meaningful solutions, preventing unrealistic extrapolations.
3.  **Log-Transformation for Stability**: Features are implicitly log-transformed within the power-law calculation (`exp(exponent * log(base))`) to ensure numerical stability, especially when dealing with large input values and negative exponents.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    X = np.atleast_2d(np.asarray(data_points)) # (N,3) array with columns [unique_tokens, params, tokens]
    
    # Ensure X values are positive before logging to prevent log(0) or log(negative).
    # The given data ranges are large and positive, so this is primarily for robustness.
    X_safe = np.maximum(X, 1e-9) 
    
    # Extract features: U=unique_tokens, P=params, D=tokens
    U = X_safe[:, 0] # unique_tokens
    P = X_safe[:, 1] # params
    D = X_safe[:, 2] # tokens

    params_arr = np.asarray(params)
    # Handle cases where params might be (1, P_total) from reshape in objective
    if params_arr.ndim == 2 and params_arr.shape[0] == 1:
        params_arr = params_arr[0] 

    # Parameters: [C1, a, b, c, C2, e, B] (7 parameters)
    # Functional form: L = C1 * U^a * P^b * D^c + C2 * P^e + B
    C1, a, b, c, C2, e, B = params_arr

    # Calculate terms using log-exp for numerical stability: Coeff * X^Exp = Coeff * exp(Exp * log(X))
    # Term 1: C1 * U^a * P^b * D^c = C1 * exp(a*log(U) + b*log(P) + c*log(D))
    log_term1_sum = a * np.log(U) + b * np.log(P) + c * np.log(D)
    term1 = C1 * np.exp(log_term1_sum)

    # Term 2: C2 * P^e = C2 * exp(e * log(P))
    term2 = C2 * np.exp(e * np.log(P))
    
    pred = term1 + term2 + B

    # Ensure predictions are non-negative and have a plausible minimum.
    # Cross-entropy loss cannot be negative, and 0.5 is a reasonable lower bound.
    pred = np.maximum(pred, 0.5) 

    return pred


def fit_scaling_law(data_points, loss_values):
    X = np.atleast_2d(np.asarray(data_points))
    y = np.asarray(loss_values)
    
    P_total = 7  # Total number of parameters

    if y.ndim == 1:
        y2d = y[:, None] # Reshape to (N, 1) for consistent MSE calculation
    else:
        y2d = y
    T = y2d.shape[1] # T=1 for a single loss array in this problem context

    # --- Initial Guesses for parameters: [C1, a, b, c, C2, e, B] ---
    # These initial guesses are chosen to be within a reasonable range for LLM scaling.
    # Coefficients C1, C2: Moderate values, as exponents will make power terms small.
    initial_C1 = 5.0       
    initial_a  = -0.05      # Exponent for unique_tokens (U)
    initial_b  = -0.07      # Exponent for params (P) in the main multiplicative term
    initial_c  = -0.07      # Exponent for tokens (D)
    initial_C2 = 5.0        # Coefficient for the additional params term
    initial_e  = -0.05      # Exponent for params (P) in the additional term
    # Irreducible loss B: Estimate from minimum observed loss, ensuring positive and realistic.
    initial_B  = np.maximum(np.min(y) * 0.8, 0.6) 

    init_1d = np.array([initial_C1, initial_a, initial_b, initial_c, 
                        initial_C2, initial_e, initial_B])

    # --- Refined Bounds for Parameters ---
    # These bounds are tightened based on common observations in LLM scaling laws
    # and the specific roles of parameters in the new functional form, ensuring
    # physically plausible and stable solutions.
    
    # Coefficients C1, C2: Must be positive. (1e-3, 1e3) prevents extreme values.
    coeffs_bounds = (1e-3, 1e3) 
    
    # Exponents a, b, c, e: Must be negative for loss to decrease with scale.
    # (-0.3, -0.01) is a typical range for LLM scaling exponents, preventing overly
    # steep or nearly flat curves.
    exponents_bounds = (-0.3, -0.01) 
    
    # Bias B: Tighter range for irreducible cross-entropy loss (0.6, 1.5).
    bias_bounds = (0.6, 1.5)           

    # Assemble bounds in the order [C1, a, b, c, C2, e, B]
    bounds = ([coeffs_bounds, exponents_bounds, exponents_bounds, exponents_bounds, 
               coeffs_bounds, exponents_bounds, bias_bounds])

    def objective(flat_params):
        params_reshaped = flat_params.reshape(T, P_total)
        pred = scaling_law_func(X, params_reshaped) # (N,)
        mse = np.mean((pred[:, None] - y2d) ** 2)
        return mse

    # Use 'L-BFGS-B' for bounded optimization, which is suitable for this problem.
    result = minimize(objective, init_1d, method='L-BFGS-B', bounds=bounds)

    # Reshape optimized parameters back to (T, P_total) as expected by the caller.
    # If optimization fails, fall back to the initial guess to ensure a valid return.
    params_opt = result.x.reshape(T, P_total) if result.success else init_1d.reshape(T, P_total)

    return params_opt[0] if T == 1 else params_opt
# EVOLVE-BLOCK-END