SLD - Domain Mixture Scaling Law - SLDAgent + Gemini 2.5 Flash

All Runs (sorted by R²)

Best Run 2 R² = 0.998312

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolved program introduces a new functional form for the scaling law and
enhances the optimization strategy to improve accuracy and robustness.

Key Improvements:
1.  **Scaling Law Functional Form:** The `scaling_law_func` now models each output
    domain's loss (L_j) based on the sum of power-law contributions from all input
    domain proportions (x_k), where the exponent (e_j) is specific to the *output*
    domain j, rather than specific to the input domain k.
    New form: `L_j = b_j + sum_{k=1 to F} (c_jk * x_k^e_j)`
    -   `b_j`: Bias for output domain j.
    -   `c_jk`: Coefficient for influence of input domain k on output domain j.
    -   `e_j`: Exponent for output domain j, applied to all input proportions affecting L_j.
    This maintains the 35-parameter limit (F biases + F*F coefficients + F exponents = 5 + 25 + 5 = 35).
    This structure allows each output domain to exhibit its own scaling behavior with respect
    to the mixture proportions, which might better capture cross-domain generalization effects.

2.  **Optimization Robustness:** The `fit_scaling_law` function now employs:
    -   **Multiple Random Initializations:** In addition to a deterministic initial guess,
        it performs `N_TRIALS` (increased to 30) optimizations from randomly generated
        initial parameters within their defined bounds. This helps to explore the
        non-convex parameter space more thoroughly and mitigate getting stuck in local minima.
    -   **Refined Exponent Initialization Range:** For the random initializations,
        the exponents are initialized within a narrower, more common range (0.0 to 2.0)
        to guide the search towards plausible scaling behaviors, while the broader
        L-BFGS-B bounds (0.0 to 5.0) still allow the optimizer to explore further if beneficial.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts multi-domain loss values based on domain proportions using a generalized power law.

    The model for each domain's loss L_j is:
    L_j = b_j + sum_{k=1 to F} (c_jk * x_k^e_j)

    Where:
    - F is the number of domains (5).
    - x_k is the proportion of domain k.
    - b_j is the bias term for output domain j.
    - c_jk is the coefficient representing the influence of input proportion x_k on output loss L_j.
    - e_j is the exponent for output domain j, shared across all input proportions affecting L_j.

    Total parameters: F (biases) + F*F (coefficients) + F (exponents) = 5 + 25 + 5 = 35.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        params (np.ndarray): 1D array of 35 parameters.

    Returns:
        np.ndarray: Predicted multi-domain loss values (N, F).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    N, F = X.shape # F = 5 (number of domains)

    # Unpack parameters: [b_1..b_F, c_11..c_FF, e_1..e_F]
    # The parameters are ordered as: F biases, F*F coefficients, F exponents
    biases = params[:F] # (F,) - bias for each output domain L_j
    coeffs_flat = params[F : F + F*F] # (F*F,) - flattened coefficients
    coeffs = coeffs_flat.reshape(F, F) # (F, F) - c_jk where c_jk is coeffs[j, k]
                                       # coeffs[j, k] means the influence of input domain k on output domain j
    # In this model, exponents are specific to the output domain j, so e_j
    exponents = params[F + F*F : F + F*F + F] # (F,) - exponent for each output domain L_j

    predicted_losses = np.zeros_like(X, dtype=float) # (N, F)

    # Calculate predictions for each output domain L_j
    for j in range(F):
        # Get the exponent specific to this output domain j
        ej = exponents[j]

        # Calculate X_k^ej for all input domains k and all data points N
        # np.where handles 0^e = 0, which is desired for proportions where 0 means absence.
        # The bounds ensure exponents are non-negative, so 0^0 = 1 behavior is avoided for X=0.
        X_powered_by_ej = np.where(X > 0, np.power(X, ej), 0.0) # (N, F)

        # Multiply by coefficients c_jk for this specific output domain j
        # coeffs[j, :] gives the (F,) array of coefficients [c_j1, c_j2, ..., c_jF]
        # [None, :] broadcasts it to (1, F) for element-wise multiplication with (N, F)
        contributions_for_j = X_powered_by_ej * coeffs[j, :][None, :] # (N, F)

        # Sum contributions from all input domains k for each data point
        sum_contributions_for_j = np.sum(contributions_for_j, axis=1) # (N,)

        # Add bias for this output domain j
        predicted_losses[:, j] = biases[j] + sum_contributions_for_j

    return predicted_losses


def fit_scaling_law(data_points, loss_values):
    """
    Optimizes the parameters for the scaling_law_func using L-BFGS-B with multiple initializations.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        loss_values (np.ndarray): Corresponding multi-domain losses (N, F).

    Returns:
        np.ndarray: Optimized parameters (1D array of 35 parameters).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    y = np.asarray(loss_values) # (N, F)
    N, F = X.shape # F = 5 (number of domains)

    # Number of parameters for the multi-output model: F biases + F*F coefficients + F exponents
    num_params = F + F*F + F # 5 + 25 + 5 = 35

    # Bounds for parameters to ensure numerical stability and reasonable values
    # Biases: Loss values are typically positive. Range [0, 5] covers 1.8-4.2 well.
    bias_bounds = [(0.0, 5.0)] * F
    # Coefficients: Can be positive or negative, allowing for various interaction types. Range [-10, 10].
    coeff_bounds = [(-10.0, 10.0)] * (F*F)
    # Exponents: Must be non-negative. Common scaling law exponents are often between 0 and 2.
    # Allowing up to 5 provides flexibility for the optimizer, but initial random search will be tighter.
    exponent_bounds_optimizer = [(0.0, 5.0)] * F

    # Combined bounds for the L-BFGS-B optimizer
    bounds = bias_bounds + coeff_bounds + exponent_bounds_optimizer

    def objective(params_flat):
        """Calculates the Mean Squared Error between predictions and actual loss values."""
        pred = scaling_law_func(X, params_flat) # (N, F)
        mse = np.mean((pred - y) ** 2)
        return mse

    best_mse = np.inf
    best_params = None

    # --- Deterministic initial guess ---
    # Biases: Mean loss for each domain is a good starting point.
    init_biases_det = np.mean(y, axis=0) # (F,)
    # Coefficients: Initialize to zeros. The optimizer will find the interactions.
    init_coeffs_det = np.zeros((F, F)) # (F, F)
    # Exponents: Initialize to 1.0 (linear scaling) for all output domains.
    init_exponents_det = np.ones(F) * 1.0 # (F,)

    # Combine initial parameters into a single 1D array
    initial_params_det = np.concatenate([init_biases_det, init_coeffs_det.ravel(), init_exponents_det]) # (num_params,)

    # 1. Run optimization with the deterministic initial guess
    result_det = minimize(objective, initial_params_det, method='L-BFGS-B', bounds=bounds)
    if result_det.success and result_det.fun < best_mse:
        best_mse = result_det.fun
        best_params = result_det.x

    # --- Multiple Random Initializations ---
    N_TRIALS = 30 # Number of random initializations to try for better global optimum search

    # Tighter random initialization bounds for exponents (more typical range)
    exponent_random_init_range = (0.0, 2.0)

    for _ in range(N_TRIALS):
        # Generate random initial parameters within their respective bounds
        random_biases = np.random.uniform(bias_bounds[0][0], bias_bounds[0][1], F)
        random_coeffs = np.random.uniform(coeff_bounds[0][0], coeff_bounds[0][1], F*F)
        random_exponents = np.random.uniform(exponent_random_init_range[0], exponent_random_init_range[1], F)
        
        random_initial_params = np.concatenate([random_biases, random_coeffs, random_exponents])

        result_rand = minimize(objective, random_initial_params, method='L-BFGS-B', bounds=bounds)
        if result_rand.success and result_rand.fun < best_mse:
            best_mse = result_rand.fun
            best_params = result_rand.x
            
    # Fallback: if no successful optimization was found, return the deterministic initial guess.
    # This scenario should be rare with good bounds and initial guesses.
    if best_params is None:
        return initial_params_det

    return best_params
# EVOLVE-BLOCK-END

#2 Run 3 R² = 0.998155

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios
Evolved program with a more sophisticated multi-output power law form that respects parameter limits.
Improvements focus on refined parameter initialization and tighter bounds for enhanced optimization stability and realism.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts multi-domain loss values based on domain proportions using a generalized power law.

    The model for each domain's loss L_j is:
    L_j = b_j + sum_{k=1 to F} (c_jk * x_k^e_k)

    Where:
    - F is the number of domains (5).
    - x_k is the proportion of domain k.
    - b_j is the bias term for output domain j.
    - c_jk is the coefficient representing the influence of input proportion x_k on output loss L_j.
    - e_k is the exponent for input proportion x_k, shared across all output losses L_j.

    Total parameters: F (biases) + F*F (coefficients) + F (exponents) = 5 + 25 + 5 = 35.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        params (np.ndarray): 1D array of 35 parameters.

    Returns:
        np.ndarray: Predicted multi-domain loss values (N, F).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    N, F = X.shape # F = 5

    # Unpack parameters: [b_1..b_F, c_11..c_FF, e_1..e_F]
    biases = params[:F] # (F,) - bias for each output domain L_j
    coeffs_flat = params[F : F + F*F] # (F*F,)
    coeffs = coeffs_flat.reshape(F, F) # (F, F) - c_jk where c_jk is coeffs[j, k]
    exponents = params[F + F*F : F + F*F + F] # (F,) - exponent for each input proportion x_k

    # Calculate X_k^e_k for each input dimension k.
    # np.where handles 0^e = 0, which is desired for proportions where 0 means absence.
    # This avoids issues with negative exponents and correctly handles 0^positive_exponent.
    X_power_exponents = np.where(X > 0, np.power(X, exponents[None, :]), 0.0) # (N, F)

    # Calculate sum_k (c_jk * X_k^e_k) for each L_j.
    # This is equivalent to (N, F) @ (F, F).T -> (N, F)
    # The j-th column of the result is L_j.
    pred_no_bias = X_power_exponents @ coeffs.T # (N, F)

    # Add biases to each output domain's prediction.
    pred = pred_no_bias + biases[None, :] # (N, F) + (1, F) -> (N, F)

    return pred


def fit_scaling_law(data_points, loss_values):
    """
    Optimizes the parameters for the scaling_law_func using L-BFGS-B.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        loss_values (np.ndarray): Corresponding multi-domain losses (N, F).

    Returns:
        np.ndarray: Optimized parameters (1D array of 35 parameters).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    y = np.asarray(loss_values) # (N, F)
    N, F = X.shape # F = 5

    # Number of parameters for the multi-output model
    num_params = F + F*F + F # 5 + 25 + 5 = 35

    # --- Improved Initial Guess for Parameters ---
    # Biases: Mean loss for each domain is a good starting point.
    init_biases = np.mean(y, axis=0) # (F,)
    # Coefficients: Initialize diagonal elements to 1.0 (self-influence), others to 0 (no cross-influence).
    # This provides a more informed starting point than all zeros.
    init_coeffs = np.eye(F) * 1.0 # (F, F)
    init_coeffs_flat = init_coeffs.ravel() # (F*F,)
    # Exponents: Initialize to 0.5 (sublinear scaling), which is common in many scaling laws.
    init_exponents = np.full(F, 0.5) # (F,)

    # Combine initial parameters into a single 1D array
    initial_params = np.concatenate([init_biases, init_coeffs_flat, init_exponents]) # (num_params,)

    # --- Improved Bounds for Parameters to Enhance Stability and Realism ---
    # Biases: Tighter range based on observed loss values (1.8-4.2).
    bias_bounds = [(1.5, 4.5)] * F
    # Coefficients: Tighter range, assuming influences are not extremely large.
    coeff_bounds = [(-2.0, 2.0)] * (F*F)
    # Exponents: Non-zero positive lower bound (e.g., 0.1) and a reasonable upper bound (e.g., 2.0)
    # for typical scaling law behavior, avoiding excessively steep or flat curves.
    exponent_bounds = [(0.1, 2.0)] * F

    bounds = bias_bounds + coeff_bounds + exponent_bounds

    def objective(params_flat):
        """Calculates the Mean Squared Error between predictions and actual loss values."""
        pred = scaling_law_func(X, params_flat) # (N, F)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use 'L-BFGS-B' for optimization as it supports parameter bounds and is efficient for this type of problem.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds)

    # Return optimized parameters if successful, otherwise return the initial parameters.
    params_opt = result.x if result.success else initial_params

    return params_opt
# EVOLVE-BLOCK-END

#3 Run 1 R² = 0.998154

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolved program refines the optimization strategy by introducing a subtly perturbed
initial guess for the coefficients in the `fit_scaling_law` function. The `scaling_law_func`
remains robust, modeling each domain's loss as a base loss plus a sum of power-law terms,
where each input proportion x_k is raised to a shared exponent e_k, and its influence
on each output loss L_j is governed by a coefficient c_jk. This structure efficiently
uses 35 parameters (5 biases, 25 coefficients, 5 exponents), adhering to the parameter limit.

The `fit_scaling_law` function enhances the L-BFGS-B optimization by initializing
coefficients with very small, non-zero values. Specifically, diagonal coefficients
(representing a domain's self-influence on its own loss) are given a tiny negative bias
(-1e-3), while off-diagonal coefficients (cross-domain influence) are given a tiny positive
bias (1e-3). This subtle perturbation from an all-zero start aims to break initial symmetry
and potentially guide the optimizer toward a slightly better local minimum by reflecting
the general intuition that increasing a domain's proportion tends to reduce its own loss,
while other domains might have a minor, slightly adverse, or neutral effect.
Parameter bounds are kept tight to ensure numerical stability and physical realism for loss values.

The model generalizes across different model sizes without explicit input, relying on a robust
fit that captures the underlying relationship between domain mixture proportions and losses
across the entire dataset.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts multi-domain loss values based on domain proportions using a generalized power law.

    The model for each domain's loss L_j is:
    L_j = b_j + sum_{k=1 to F} (c_jk * x_k^e_k)

    Where:
    - F is the number of domains (5).
    - x_k is the proportion of domain k.
    - b_j is the bias term for output domain j.
    - c_jk is the coefficient representing the influence of input proportion x_k on output loss L_j.
    - e_k is the exponent for input proportion x_k, shared across all output losses L_j.

    This formulation allows for a base loss for each domain, cross-domain interactions,
    and non-linear scaling of proportions, all within the 35-parameter limit.

    Total parameters: F (biases) + F*F (coefficients) + F (exponents) = 5 + 25 + 5 = 35.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        params (np.ndarray): 1D array of 35 parameters.

    Returns:
        np.ndarray: Predicted multi-domain loss values (N, F).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    N, F = X.shape # F = 5

    # Unpack parameters: [b_1..b_F, c_11..c_FF, e_1..e_F]
    biases = params[:F] # (F,) - bias for each output domain L_j
    coeffs_flat = params[F : F + F*F] # (F*F,)
    coeffs = coeffs_flat.reshape(F, F) # (F, F) - c_jk where c_jk is coeffs[j, k]
    exponents = params[F + F*F : F + F*F + F] # (F,) - exponent for each input proportion x_k

    # Calculate X_k^e_k for each input dimension k.
    # np.where handles 0^e = 0, which is crucial for proportions where 0 means absence.
    # This avoids issues with negative exponents and ensures 0^0 is treated as 0 for proportions.
    # The bounds on exponents (0.0, 3.0) further enhance numerical stability.
    X_power_exponents = np.where(X > 0, np.power(X, exponents[None, :]), 0.0) # (N, F)

    # Calculate sum_k (c_jk * X_k^e_k) for each L_j.
    # This is efficiently done via matrix multiplication: (N, F) @ (F, F).T -> (N, F)
    # The j-th column of the result represents the contribution of proportions to L_j.
    pred_no_bias = X_power_exponents @ coeffs.T # (N, F)

    # Add biases to each output domain's prediction.
    pred = pred_no_bias + biases[None, :] # (N, F) + (1, F) -> (N, F)

    return pred


def fit_scaling_law(data_points, loss_values):
    """
    Optimizes the parameters for the scaling_law_func using L-BFGS-B,
    with refined parameter bounds and a slightly perturbed initial guess for coefficients.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        loss_values (np.ndarray): Corresponding multi-domain losses (N, F).

    Returns:
        np.ndarray: Optimized parameters (1D array of 35 parameters).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    y = np.asarray(loss_values) # (N, F)
    N, F = X.shape # F = 5

    num_params = F + F*F + F

    # Initial guess for parameters to aid optimization
    # Biases: Mean loss for each domain is a good starting point for the constant offset.
    init_biases = np.mean(y, axis=0) # (F,)

    # Coefficients: Slightly perturbed initial guess.
    # Diagonal elements (self-influence) are initialized to a very small negative value.
    # Off-diagonal elements (cross-influence) are initialized to a very small positive value.
    # This subtle perturbation from zero aims to guide the optimizer towards a potentially better
    # local minimum by breaking initial symmetry and reflecting basic intuitions
    # (more of a domain -> less loss for that domain; more of another domain -> slightly more loss).
    init_coeffs = np.full((F, F), 1e-3) # Start all at tiny positive
    np.fill_diagonal(init_coeffs, -1e-3) # Self-influence gets tiny negative
    init_coeffs_flat = init_coeffs.ravel() # (F*F,)

    # Exponents: Initialize to 0.7, a common value for scaling laws that often exhibit
    # diminishing returns. This provides a more informed starting point than 1.0 (linear).
    init_exponents = np.ones(F) * 0.7 # (F,)

    # Combine initial parameters into a single 1D array for the optimizer
    initial_params = np.concatenate([init_biases, init_coeffs_flat, init_exponents]) # (num_params,)

    # Define bounds for parameters to ensure numerical stability and reasonable values.
    # These bounds prevent the optimizer from exploring physically unrealistic regions,
    # improve convergence speed, and enhance robustness.

    # Biases: Loss values are typically positive and within the range 1.8-4.2.
    # Tighter bounds (1.5, 4.5) are used, reflecting the observed data range more closely.
    bias_bounds = [(1.5, 4.5)] * F
    # Coefficients: Can be positive or negative. A tighter range (-5.0, 5.0) encourages
    # finding more localized solutions and improves stability for the given loss range.
    coeff_bounds = [(-5.0, 5.0)] * (F*F)
    # Exponents: Must be non-negative. A tighter range (0.0, 3.0) allows for flexibility
    # (sublinear, linear, superlinear effects) while restricting extreme values.
    exponent_bounds = [(0.0, 3.0)] * F

    # Combine all bounds
    bounds = bias_bounds + coeff_bounds + exponent_bounds

    def objective(params_flat):
        """
        Calculates the Mean Squared Error (MSE) between predictions and actual loss values.
        This is the function that the optimizer will minimize.
        """
        pred = scaling_law_func(X, params_flat) # (N, F)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use 'L-BFGS-B' for optimization as it efficiently handles parameter bounds.
    # This method is suitable for problems with a moderate number of parameters and bounds.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds)

    # Return optimized parameters if the optimization was successful,
    # otherwise, return the initial parameters as a fallback to prevent errors.
    params_opt = result.x if result.success else initial_params

    return params_opt
# EVOLVE-BLOCK-END

#4 Run 4 R² = 0.998154

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios
Evolved program with a sophisticated multi-output power law form, respecting parameter limits.
This version refines the optimization process by using a slightly perturbed initialization
for coefficients and exponents, while retaining the proven effectiveness of the broader parameter bounds
and lack of explicit L2 regularization from the highest-performing previous attempt.
The goal is to achieve high accuracy and robustness in parameter fitting.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts multi-domain loss values based on domain proportions using a generalized power law.

    The model for each domain's loss L_j is:
    L_j = b_j + sum_{k=1 to F} (c_jk * x_k^e_k)

    Where:
    - F is the number of domains (5).
    - x_k is the proportion of domain k.
    - b_j is the bias term for output domain j.
    - c_jk is the coefficient representing the influence of input proportion x_k on output loss L_j.
    - e_k is the exponent for input proportion x_k, shared across all output losses L_j.

    Total parameters: F (biases) + F*F (coefficients) + F (exponents) = 5 + 25 + 5 = 35.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        params (np.ndarray): 1D array of 35 parameters.

    Returns:
        np.ndarray: Predicted multi-domain loss values (N, F).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    N, F = X.shape # F = 5

    # Unpack parameters: [b_1..b_F, c_11..c_FF, e_1..e_F]
    biases = params[:F] # (F,) - bias for each output domain L_j
    coeffs_flat = params[F : F + F*F] # (F*F,)
    coeffs = coeffs_flat.reshape(F, F) # (F, F) - c_jk where c_jk is coeffs[j, k]
    exponents = params[F + F*F : F + F*F + F] # (F,) - exponent for each input proportion x_k

    # Calculate X_k^e_k for each input dimension k.
    # np.where handles 0^e = 0, which is desired for proportions where 0 means absence.
    # This avoids issues with negative exponents and 0^0 which numpy defaults to 1.0.
    X_power_exponents = np.where(X > 0, np.power(X, exponents[None, :]), 0.0) # (N, F)

    # Calculate sum_k (c_jk * X_k^e_k) for each L_j.
    # This is equivalent to (N, F) @ (F, F).T -> (N, F)
    pred_no_bias = X_power_exponents @ coeffs.T # (N, F)

    # Add biases to each output domain's prediction.
    pred = pred_no_bias + biases[None, :] # (N, F) + (1, F) -> (N, F)

    return pred


def fit_scaling_law(data_points, loss_values):
    """
    Optimizes the parameters for the scaling_law_func using L-BFGS-B.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        loss_values (np.ndarray): Corresponding multi-domain losses (N, F).

    Returns:
        np.ndarray: Optimized parameters (1D array of 35 parameters).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    y = np.asarray(loss_values) # (N, F)
    N, F = X.shape # F = 5

    num_params = F + F*F + F # 5 + 25 + 5 = 35

    # Set a seed for reproducibility of random initializations where used.
    np.random.seed(42)

    # Initial guess for parameters
    # Biases: Mean loss for each domain is a robust starting point.
    init_biases = np.mean(y, axis=0) # (F,)
    
    # Coefficients: Initialize with a slightly wider range of small random values.
    # This helps the optimizer explore a broader initial landscape.
    init_coeffs_flat = np.random.uniform(-0.05, 0.05, F*F) # (F*F,)
    
    # Exponents: Initialize with a slight random perturbation around 1.0.
    # This provides a more varied starting point for power law exponents,
    # allowing for exploration of both sub-linear and super-linear scaling.
    init_exponents = np.random.uniform(0.8, 1.2, F) # (F,)

    # Combine initial parameters into a single 1D array
    initial_params = np.concatenate([init_biases, init_coeffs_flat, init_exponents]) # (num_params,)

    # Bounds for parameters to ensure numerical stability and reasonable values.
    # These bounds are chosen to be relatively permissive, allowing the optimizer
    # sufficient freedom to find optimal values.
    # Biases: Loss values are positive (1.8-4.2). A range of [0.0, 5.0] is generous.
    bias_bounds = [(0.0, 5.0)] * F
    
    # Coefficients: Can be positive or negative, reflecting various interaction types.
    # A wider range [-10.0, 10.0] gives flexibility for strong influences.
    coeff_bounds = [(-10.0, 10.0)] * (F*F)
    
    # Exponents: Must be non-negative. Allowing up to 5.0 covers typical power law
    # behaviors and provides flexibility.
    exponent_bounds = [(0.0, 5.0)] * F

    bounds = bias_bounds + coeff_bounds + exponent_bounds

    def objective(params_flat):
        """
        Calculates the Mean Squared Error between predictions and actual loss values.
        L2 regularization is omitted as previous successful attempts suggest it might
        not be necessary or could be detrimental for this specific dataset and model.
        """
        pred = scaling_law_func(X, params_flat) # (N, F)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use 'L-BFGS-B' for optimization as it supports parameter bounds.
    # Default options for 'maxiter', 'ftol', 'gtol' are often robust and sufficient.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds)

    # Return optimized parameters if the optimization was successful,
    # otherwise return the initial parameters as a fallback to ensure stability.
    params_opt = result.x if result.success else initial_params

    return params_opt
# EVOLVE-BLOCK-END

#5 Run 5 R² = 0.998153

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios
Evolved program focusing on refined optimization parameters for the L-BFGS-B algorithm,
while retaining the highly effective multi-output power law model and robust initialization/bounds
identified in previous top-performing solutions. The model form, with its 35 parameters,
addresses the complex interplay of domain proportions on multi-domain losses.
The primary evolution is to switch the objective function from Mean Squared Error to Sum of Squared Errors
and adjust the convergence tolerances accordingly, aiming for potentially more precise optimization.
"""
import numpy as np
from scipy.optimize import minimize

def scaling_law_func(data_points, params):
    """
    Predicts multi-domain loss values based on domain proportions using a generalized power law.

    The model for each domain's loss L_j is:
    L_j = b_j + sum_{k=1 to F} (c_jk * x_k^e_k)

    Where:
    - F is the number of domains (5).
    - x_k is the proportion of domain k.
    - b_j is the bias term for output domain j.
    - c_jk is the coefficient representing the influence of input proportion x_k on output loss L_j.
    - e_k is the exponent for input proportion x_k, shared across all output losses L_j.

    Total parameters: F (biases) + F*F (coefficients) + F (exponents) = 5 + 25 + 5 = 35.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        params (np.ndarray): 1D array of 35 parameters.

    Returns:
        np.ndarray: Predicted multi-domain loss values (N, F).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    N, F = X.shape # F = 5

    # Unpack parameters: [b_1..b_F, c_11..c_FF, e_1..e_F]
    biases = params[:F] # (F,) - bias for each output domain L_j
    coeffs_flat = params[F : F + F*F] # (F*F,)
    coeffs = coeffs_flat.reshape(F, F) # (F, F) - c_jk where c_jk is coeffs[j, k]
    exponents = params[F + F*F : F + F*F + F] # (F,) - exponent for each input proportion x_k

    # Calculate X_k^e_k for each input dimension k.
    # np.where handles 0^e = 0, which is desired for proportions where 0 means absence.
    # This avoids issues with negative exponents and 0^0.
    # The `exponents` are bounded to be non-negative during optimization,
    # so 0^negative_exponent is not a concern.
    X_power_exponents = np.where(X > 0, np.power(X, exponents[None, :]), 0.0) # (N, F)

    # Calculate sum_k (c_jk * X_k^e_k) for each L_j.
    # This is equivalent to (N, F) @ (F, F).T -> (N, F)
    pred_no_bias = X_power_exponents @ coeffs.T # (N, F)

    # Add biases to each output domain's prediction.
    pred = pred_no_bias + biases[None, :] # (N, F) + (1, F) -> (N, F)

    return pred


def fit_scaling_law(data_points, loss_values):
    """
    Optimizes the parameters for the scaling_law_func using L-BFGS-B,
    with an objective function based on Sum of Squared Errors and adjusted convergence criteria
    to potentially achieve a more precise fit.

    Args:
        data_points (np.ndarray): (N, F) array with domain proportions for F domains.
        loss_values (np.ndarray): Corresponding multi-domain losses (N, F).

    Returns:
        np.ndarray: Optimized parameters (1D array of 35 parameters).
    """
    X = np.atleast_2d(np.asarray(data_points)) # (N, F)
    y = np.asarray(loss_values) # (N, F)
    N, F = X.shape # F = 5

    # Number of parameters for the multi-output model
    num_params = F + F*F + F # 5 + 25 + 5 = 35

    # Initial guess for parameters (deterministic initialization, proven effective)
    # Biases: Mean loss for each domain is a good starting point.
    init_biases = np.mean(y, axis=0) # (F,)
    
    # Coefficients: Initialize to zeros. This deterministic choice proved effective
    # in previous high-scoring attempts, allowing the optimizer to find interactions.
    init_coeffs_flat = np.zeros(F*F) # (F*F,)
    
    # Exponents: Initialize to 1.0 (linear scaling) for all input proportions.
    # This provides a neutral starting point for power law behavior.
    init_exponents = np.ones(F) * 1.0 # (F,)

    # Combine initial parameters into a single 1D array
    initial_params = np.concatenate([init_biases, init_coeffs_flat, init_exponents]) # (num_params,)

    # Bounds for parameters to ensure numerical stability and reasonable values.
    # Broader bounds have shown to be more effective for this problem, allowing
    # the optimizer a wider search space without regularization.
    
    # Biases: Loss values are typically positive (1.8-4.2). Range [0.0, 5.0] is generous and safe.
    bias_bounds = [(0.0, 5.0)] * F
    
    # Coefficients: Can be positive or negative. Wide range [-10.0, 10.0] allows for strong influences.
    coeff_bounds = [(-10.0, 10.0)] * (F*F)
    
    # Exponents: Must be non-negative. Range [0.0, 5.0] is flexible for various scaling behaviors.
    exponent_bounds = [(0.0, 5.0)] * F

    bounds = bias_bounds + coeff_bounds + exponent_bounds

    def objective(params_flat):
        """
        Calculates the Sum of Squared Errors between predictions and actual loss values.
        This provides a larger objective value scale compared to MSE.
        """
        pred = scaling_law_func(X, params_flat) # (N, F)
        sse = np.sum((pred - y) ** 2) # Changed from np.mean to np.sum
        return sse

    # Use 'L-BFGS-B' for optimization as it supports parameter bounds.
    # Refined convergence criteria:
    # Adjusted ftol and gtol to be appropriate for the Sum of Squared Errors objective,
    # which has a larger scale than Mean Squared Error.
    # A total of N*F = 80*5 = 400 individual loss values contribute to the sum.
    # Default ftol (machine epsilon, ~2.22e-9) scaled by 400 is ~8.88e-7. Using 1e-6.
    # Default gtol (1e-5) scaled by 400 is 4e-3. Using 1e-4 (slightly tighter).
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds,
                      options={'maxiter': 2000, 'ftol': 1e-6, 'gtol': 1e-4})

    # Return optimized parameters if successful, otherwise return initial parameters.
    params_opt = result.x if result.success else initial_params

    return params_opt
# EVOLVE-BLOCK-END