SLD - U-shaped Scaling Law - SLDAgent + Gemini 2.5 Flash

All Runs (sorted by R²)

Best Run 5 R² = 0.925978

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolution refines the U-shaped scaling law function by using a Lorentzian-like
peak on a *linear* baseline (5 parameters). This model form, compared to a Gaussian,
often provides better stability and fit for limited data due to its heavier tails,
which can capture the broader influence of the "badness" region more effectively.
It significantly improves the robust optimization algorithm by using L-BFGS-B with
enhanced initial parameter guesses, comprehensive dynamic bounds, and multiple
initializations (including specific heuristics and random sampling) to better
explore the non-convex objective function and capture the U-shaped or double descent pattern.
A robust fallback mechanism ensures a result is always returned, even in challenging data scenarios.

Key improvements in this version:
- Further widened bounds for 'A' (amplitude) and 'w' (width) parameters to capture a broader range of U-shapes.
- Increased number of multiple initializations to enhance the optimizer's ability to find a global optimum in a non-convex landscape.
- More systematic generation of initial parameter guesses for 'A', 'x0', and 'w', combining linear/logarithmic spacing, random sampling, and strategic points to ensure comprehensive coverage of the parameter space.
- Enhanced numerical stability by explicitly nudging 'w' away from its lower bound if initial guesses are too close.
"""
import numpy as np
from scipy.optimize import minimize
from scipy.stats import linregress

def scaling_law_func(data_points, params):
    """
    Models a U-shaped relationship (performance worsens then improves) using a
    Lorentzian-like peak on a linear baseline. This allows brier_score (negative,
    more negative is better) to increase (worsening) then decrease (improve).

    The model uses 5 parameters to adhere to the constraint and prioritize simplicity.

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    params (list or np.ndarray): Array of 5 parameters [A, x0, w, B, C].
        A: Amplitude of the "badness" peak. A positive 'A' value will push
           the brier_score towards zero (worsening performance).
        x0: log_flops value at the center of the peak, representing the scale
            where performance is maximally hindered or worst.
        w: Width parameter of the peak. Controls how broad the "badness" region is.
           Must be positive.
        B: Slope of the underlying linear trend. Captures the overall long-term
           scaling behavior.
        C: Intercept of the underlying linear trend.

    Returns:
    np.ndarray: Predicted brier_score values (negative).
    """
    x = np.atleast_1d(np.asarray(data_points)).flatten() # Ensure x is 1D

    # Unpack parameters: A, x0, w, B, C (5 parameters)
    A, x0, w, B, C = params

    # Ensure 'w' is not too small to prevent division by zero or numerical instability.
    # A small positive value is used if w is non-positive or too close to zero.
    w_safe = np.maximum(w, 1e-9)
    
    # Lorentzian-like peak for "badness" + linear baseline
    # A positive A term creates a bump, pushing negative brier_scores towards zero (worsening).
    # B*x + C models the overall long-term scaling trend.
    pred = A / (1 + ((x - x0) / w_safe)**2) + B * x + C

    return pred

def fit_scaling_law(data_points, loss_values):
    """
    Fits the U-shaped scaling law function to data using L-BFGS-B with
    robust initial parameter guesses, comprehensive bounds, and multiple
    initializations to better explore the parameter space for a global minimum,
    especially for non-convex objective functions.

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    loss_values (np.ndarray): Array of corresponding brier_score values.

    Returns:
    np.ndarray: Optimized parameters [A, x0, w, B, C].
    """
    x = np.atleast_1d(np.asarray(data_points)).flatten()
    y = np.atleast_1d(np.asarray(loss_values)).flatten()

    # Handle edge case: very few data points, especially for linregress
    # Return a sensible default to avoid errors and ensure a result is always provided.
    if len(x) < 2:
        mean_x_safe = np.mean(x) if x.size > 0 else 0.0
        mean_y_safe = np.mean(y) if y.size > 0 else 0.0
        return np.array([0.01, mean_x_safe, 1.0, 0.0, mean_y_safe])

    # Objective function to minimize (Mean Squared Error)
    def objective(params):
        pred = scaling_law_func(x, params)
        mse = np.mean((pred - y) ** 2)
        return mse

    best_mse = np.inf
    best_params = None

    # --- Initial Parameter Guesses and Bounds Setup ---
    # 1. Linear regression for initial B (slope) and C (intercept)
    if np.std(x) < 1e-9: # x values are essentially constant
        slope = 0.0
        intercept = np.mean(y)
    else:
        slope, intercept, _, _, _ = linregress(x, y)
    B_base = slope
    C_base = intercept

    # 2. x0_range: Range for the center of the peak
    x_min, x_max = np.min(x), np.max(x)
    data_range = x_max - x_min
    
    # Robust calculation of x0 bounds and w bounds, handling small or zero data_range
    if data_range < 1e-6: # If x values are almost constant
        x0_bound_low = x_min - 1.0
        x0_bound_high = x_max + 1.0
        w_min_bound = 0.05 # Default for very narrow range
        w_max_bound = 10.0 # Default for very narrow range
    else:
        x0_bound_low = x_min - data_range * 0.2 # Wider range for x0
        x0_bound_high = x_max + data_range * 0.2
        # Refined w bounds for better exploration: allow for sharper and broader peaks
        # Allowing for very sharp peaks (small w) and very broad ones (large w)
        w_min_bound = max(1e-5, data_range / 100.0) 
        w_max_bound = max(5.0, data_range * 5.0, 15.0) # Increased cap for w_max
    
    x0_range_bounds = (x0_bound_low, x0_bound_high)

    # 3. A_base: Amplitude of the "badness" peak (must be positive)
    linear_pred = B_base * x + C_base
    residuals_from_baseline = y - linear_pred
    A_base = np.max(residuals_from_baseline) if np.max(residuals_from_baseline) > 0 else 0.01

    # Cap A_base to a reasonable value and ensure a minimum positive amplitude
    y_range = np.max(y) - np.min(y)
    # Refined A_max_bound - allows for larger peaks relative to the observed y-range
    A_max_bound = max(y_range * 3.0, 1.0) 
    A_base = min(A_base, A_max_bound * 0.75) if y_range > 0 else A_base
    if A_base < 0.001: A_base = 0.001 # Ensure a minimum positive amplitude

    # Define common bounds for all optimizations
    bounds = [
        (1e-6, A_max_bound),   # A (amplitude) must be positive and within a reasonable max.
        x0_range_bounds,       # x0 (center) constrained within a reasonable range around data.
        (w_min_bound, w_max_bound), # w (width) bounded by reasonable values.
        (None, None),          # B (slope) - no strong prior constraints.
        (None, None)           # C (intercept) - no strong prior constraints.
    ]
    
    # --- Multiple Initializations Loop ---
    num_inits = 70 # Increased number of different starting points for better exploration

    # Heuristic for initial x0: point of max residual from linear fit
    x0_peak_init_heuristic = np.mean(x) # Default if no clear peak
    if x.size > 1 and np.max(residuals_from_baseline) > 1e-6:
        x0_peak_init_heuristic = x[np.argmax(residuals_from_baseline)]

    # Generate varied initial guesses for A, x0, w.
    A_inits = np.unique(np.concatenate([
        np.linspace(max(1e-6, A_base * 0.05), A_max_bound, num_inits // 4),
        np.random.uniform(max(1e-6, A_base * 0.05), A_max_bound, num_inits // 4),
        [A_base, max(1e-6, A_base * 0.5), A_max_bound * 0.1, A_max_bound * 0.5, A_max_bound] # Strategic points
    ]))
    A_inits = A_inits[A_inits >= 1e-6] # Ensure A is positive
    A_inits = A_inits[:num_inits] # Trim if too many unique values

    x0_inits = np.unique(np.concatenate([
        np.linspace(x0_bound_low, x0_bound_high, num_inits // 4),
        np.random.uniform(x0_bound_low, x0_bound_high, num_inits // 4),
        [x0_peak_init_heuristic, np.mean(x), x_min, x_max, (x_min + x_max) / 2.0] # Strategic points
    ]))
    x0_inits = x0_inits[:num_inits]

    # Use logspace for w_inits to cover a broader range effectively
    # Handle cases where log_w_min >= log_w_max (e.g., if w_min_bound is very large, or w_max_bound is small)
    log_w_min = np.log10(w_min_bound) if w_min_bound > 0 else -10.0 # Default to a very small log value if w_min_bound is zero or less
    log_w_max = np.log10(w_max_bound) if w_max_bound > 0 else 10.0 # Default to a very large log value
    
    # Ensure log_w_min < log_w_max for logspace to work
    if log_w_min >= log_w_max: # If bounds are problematic, create a sensible default range
        log_w_min = np.log10(max(1e-6, w_min_bound))
        log_w_max = np.log10(max(1e-6, w_max_bound))
        if log_w_min >= log_w_max: # If still an issue, make a tiny range
            log_w_max = log_w_min + 1.0 # Create a small range for logspace

    w_inits = np.unique(np.concatenate([
        np.logspace(log_w_min, log_w_max, num_inits // 4),
        10**np.random.uniform(log_w_min, log_w_max, num_inits // 4),
        [w_min_bound, w_max_bound, (w_min_bound + w_max_bound) / 2.0, data_range / 2.0] # Strategic points, ensure data_range/2 is in range
    ]))
    w_inits = w_inits[w_inits >= 1e-9] # Ensure w is positive
    w_inits = w_inits[:num_inits]


    # Iterate through initial parameter combinations
    # Using a nested loop with modulo to cycle through combinations, ensuring all initial points are used
    # and we get num_inits total attempts.
    num_A = len(A_inits)
    num_x0 = len(x0_inits)
    num_w = len(w_inits)

    actual_inits_to_try = num_inits # Use num_inits as the target for actual optimization runs

    for i in range(actual_inits_to_try):
        current_A_init = A_inits[i % num_A]
        current_x0_init = x0_inits[i % num_x0]
        current_w_init = w_inits[i % num_w]

        initial_params = [current_A_init, current_x0_init, current_w_init, B_base, C_base]
        
        # Ensure initial_params respect bounds before optimization to prevent ValueErrors
        initial_params_clamped = []
        for j, (lower, upper) in enumerate(bounds):
            clamped_val = initial_params[j]
            if lower is not None:
                clamped_val = max(clamped_val, lower)
            if upper is not None:
                clamped_val = min(clamped_val, upper)
            initial_params_clamped.append(clamped_val)
        
        # Nudge 'w' slightly above its minimum bound if it's right on it, to avoid numerical instability
        if initial_params_clamped[2] <= bounds[2][0]: # Check for <= to catch values exactly at the bound
            initial_params_clamped[2] = bounds[2][0] + 1e-9 

        try:
            result = minimize(objective, initial_params_clamped, method='L-BFGS-B', bounds=bounds,
                              options={'maxiter': 5000, 'ftol': 1e-9, 'gtol': 1e-9, 'disp': False})
            
            # Check for successful convergence and finite parameters
            if result.success and np.all(np.isfinite(result.x)) and result.fun < best_mse:
                best_mse = result.fun
                best_params = result.x
        except ValueError:
            # Catch potential errors from numerical issues during optimization (e.g., bounds violation if not clamped properly)
            continue
        except Exception:
            # Catch other potential exceptions during optimization (e.g., singular matrix)
            continue

    # Fallback: If no successful optimization found after multiple attempts,
    # perform one final robust optimization with a central initial guess.
    if best_params is None:
        # For debugging: print(f"Warning: Multiple initializations failed. Attempting robust fallback.")
        fallback_A_init = A_base
        fallback_x0_init = x0_peak_init_heuristic
        
        # Use log-midpoint for fallback_w_init if log_w_min < log_w_max, otherwise use linear midpoint
        if log_w_min < log_w_max:
            fallback_w_init = 10**((log_w_min + log_w_max) / 2.0)
        else: 
            fallback_w_init = (w_min_bound + w_max_bound) / 2.0

        initial_params_fallback = [fallback_A_init, fallback_x0_init, fallback_w_init, B_base, C_base]
        
        # Ensure fallback parameters respect bounds
        initial_params_clamped_fallback = []
        for j, (lower, upper) in enumerate(bounds):
            clamped_val = initial_params_fallback[j]
            if lower is not None:
                clamped_val = max(clamped_val, lower)
            if upper is not None:
                clamped_val = min(clamped_val, upper)
            initial_params_clamped_fallback.append(clamped_val)
        
        # Nudge 'w' slightly above its minimum bound for fallback as well
        if initial_params_clamped_fallback[2] <= bounds[2][0]:
            initial_params_clamped_fallback[2] = bounds[2][0] + 1e-9

        result_fallback = minimize(objective, initial_params_clamped_fallback, method='L-BFGS-B', bounds=bounds,
                                   options={'maxiter': 5000, 'ftol': 1e-9, 'gtol': 1e-9, 'disp': False})
        
        if result_fallback.success and np.all(np.isfinite(result_fallback.x)):
            best_params = result_fallback.x
        else:
            # As a last resort, if even the fallback fails, return a completely default set.
            # For debugging: print(f"Warning: Fallback optimization failed. Message: {result_fallback.message}. Returning clamped initial parameters.")
            best_params = np.array(initial_params_clamped_fallback)
            # Ensure these default parameters are also finite and reasonable.
            if not np.all(np.isfinite(best_params)):
                best_params = np.array([0.01, 0.0, 1.0, 0.0, 0.0]) # Absolute default if clamping somehow failed

    return best_params
# EVOLVE-BLOCK-END

#2 Run 2 R² = 0.925499

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolution refines the U-shaped scaling law function's parameter initialization
and optimization settings to improve robustness and fit accuracy for the
characteristic U-shaped (double descent) pattern in brier_score (negative, more negative = better).
The functional form remains a Lorentzian-like peak on a linear baseline, using 5 parameters.
The improvements focus on:
1. Re-introducing numerical stability in the scaling law function.
2. Improving initial parameter guesses for peak parameters (amplitude, center, width)
   by leveraging residuals from an initial linear fit, which has proven more effective.
3. Refining parameter bounds, especially for peak width, to enhance stability.
4. Enhancing optimization robustness through internal parameter clipping and post-optimization
   bound enforcement.
"""
import numpy as np
from scipy.optimize import minimize
from scipy.stats import linregress

def scaling_law_func(data_points, params):
    """
    Models a U-shaped relationship (performance worsens then improves) using a
    Lorentzian-like peak on a linear baseline. This allows brier_score (negative,
    more negative is better) to increase (worsen) then decrease (improve).

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    params (list or np.ndarray): Array of 5 parameters [A, x0, w, B, C].
        A: Amplitude of the "badness" peak (positive for brier_score to go up).
        x0: log_flops value at the center of the peak (worst performance).
        w: Width parameter of the peak (positive).
        B: Slope of the underlying linear trend.
        C: Intercept of the underlying linear trend.

    Returns:
    np.ndarray: Predicted brier_score values.
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D

    # Unpack parameters: A, x0, w, B, C (5 parameters)
    A, x0, w, B, C = params

    # Lorentzian-like peak for "badness" + linear baseline
    # A positive A term creates a bump, pushing negative brier_scores towards zero (worsening).
    # B*x + C models the overall trend.
    # Add a small epsilon to 'w' in the denominator for numerical stability,
    # preventing potential division by zero if 'w' becomes extremely small.
    pred = A / (1 + ((x - x0) / (w + 1e-9))**2) + B * x + C

    return pred

def fit_scaling_law(data_points, loss_values):
    """
    Fits the U-shaped scaling law function to data using L-BFGS-B with
    robust initial parameter guesses and refined bounds.

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    loss_values (np.ndarray): Array of corresponding brier_score values.

    Returns:
    np.ndarray: Optimized parameters [A, x0, w, B, C].
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D
    y = np.asarray(loss_values).flatten() # Ensure y is 1D

    # --- Initial Parameter Guesses ---

    # 1. Estimate initial B (slope) and C (intercept) from a linear regression
    # over the entire dataset. This provides a baseline trend.
    slope, intercept, _, _, _ = linregress(x, y)
    B_init = slope
    C_init = intercept

    # Calculate residuals from the initial linear fit.
    # These residuals help identify the "badness" peak *above* the baseline.
    residuals = y - (B_init * x + C_init)

    # 2. x0_init: Center of the peak (worst performance point)
    # Guess as the log_flops value where the positive residual (deviation from linear baseline) is maximal.
    # This is a more informed guess than simply the mean of x or argmax(y).
    x0_init = x[np.argmax(residuals)]
    # Ensure x0_init is within the observed range, falling back to mean if argmax points outside (e.g., due to noise)
    x_min, x_max = np.min(x), np.max(x)
    if not (x_min <= x0_init <= x_max):
        x0_init = np.mean(x)
    # Clip x0_init to be strictly within the observed data range for robustness
    x0_init = np.clip(x0_init, x_min + 1e-6, x_max - 1e-6)

    # 3. A_init: Amplitude of the "badness" peak
    # This parameter should be positive (pushes brier_score towards 0).
    # Estimate from the maximum positive deviation from the initial linear fit.
    A_init = np.max(residuals)
    # Ensure A_init is positive and has a reasonable minimum value.
    # Use 0.005 as a slightly higher minimum to ensure a visible peak if max residual is very small.
    A_init = np.maximum(A_init, 0.005)
    # Cap A_init to a reasonable fraction of the y-range to prevent overshooting.
    y_range = np.max(y) - np.min(y)
    if y_range > 1e-6: # Avoid division by zero if y is constant
        A_init = np.minimum(A_init, y_range * 1.5)


    # 4. w_init: Width of the peak
    # Guess as a fraction (e.g., 1/3) of the total log_flops range.
    x_range = x_max - x_min
    w_init = x_range / 3.0
    # Ensure w is not too small initially for numerical stability.
    if w_init < 0.01:
        w_init = 0.01
    # Cap initial guess to prevent overly wide starting point
    if w_init > x_range * 2:
        w_init = x_range * 2.0

    initial_params = [A_init, x0_init, w_init, B_init, C_init]

    # --- Parameter Bounds ---
    # A, w: Must be positive (amplitude and width). Use small epsilon to avoid issues.
    # x0: Should be within the observed log_flops range.
    # B, C: Can be any real number.
    bounds = [
        (1e-6, None),                          # A (amplitude, positive)
        (x_min, x_max),                        # x0 (center, strictly within data range)
        (1e-6, x_range * 2),                   # w (width, positive, capped at 2x the data range for stability)
        (None, None),                          # B (slope)
        (None, None)                           # C (intercept)
    ]

    # Objective function to minimize (Mean Squared Error)
    def objective(params):
        # Ensure parameters stay within bounds during iteration if they drift slightly
        p_clipped = np.array(params)
        for i, (lower, upper) in enumerate(bounds):
            if lower is not None:
                p_clipped[i] = max(p_clipped[i], lower)
            if upper is not None:
                p_clipped[i] = min(p_clipped[i], upper)
        
        pred = scaling_law_func(x, p_clipped)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use L-BFGS-B, a quasi-Newton method that supports bounds,
    # generally more robust for constrained optimization.
    # Added 'maxiter', 'ftol', and 'gtol' options for better control over convergence.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds,
                      options={'maxiter': 5000, 'ftol': 1e-8, 'gtol': 1e-8})

    # Return optimized parameters if successful, otherwise the initial guesses.
    # Post-process parameters to ensure they strictly adhere to bounds,
    # especially if `result.success` is False and `initial_params` are returned,
    # or if `result.x` is slightly outside bounds due to floating point precision.
    params_opt = result.x if result.success else initial_params
    for i, (lower, upper) in enumerate(bounds):
        if lower is not None and params_opt[i] < lower:
            params_opt[i] = lower
        if upper is not None and params_opt[i] > upper:
            params_opt[i] = upper

    return params_opt
# EVOLVE-BLOCK-END

#3 Run 1 R² = 0.925380

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolution refines the U-shaped scaling law function's fitting process
by improving initial parameter guesses and bounds, while maintaining the
Lorentzian-on-linear model (5 parameters) which has shown good performance.
The goal is to enhance robustness and accuracy in capturing the double descent pattern.
"""
import numpy as np
from scipy.optimize import minimize
from scipy.stats import linregress

def scaling_law_func(data_points, params):
    """
    Models a U-shaped relationship (performance worsens then improves) using a
    Lorentzian-like peak on a linear baseline. This allows brier_score (negative,
    more negative is better) to increase (worsen) then decrease (improve).

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    params (list or np.ndarray): Array of 5 parameters [A, x0, w, B, C].
        A: Amplitude of the "badness" peak (positive for brier_score to go up).
        x0: log_flops value at the center of the peak (worst performance).
        w: Width parameter of the peak (positive).
        B: Slope of the underlying linear trend.
        C: Intercept of the underlying linear trend.

    Returns:
    np.ndarray: Predicted brier_score values.
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D

    # Unpack parameters: A, x0, w, B, C (5 parameters)
    A, x0, w, B, C = params

    # Lorentzian-like peak for "badness" + linear baseline
    # A positive A term creates a bump, pushing negative brier_scores towards zero (worsening).
    # B*x + C models the overall trend.
    # Ensure w is not zero or extremely small to prevent division by zero or large numbers,
    # which enhances numerical stability.
    w_safe = max(w, 1e-9)
    pred = A / (1 + ((x - x0) / w_safe)**2) + B * x + C

    return pred

def fit_scaling_law(data_points, loss_values):
    """
    Fits the U-shaped scaling law function to data using L-BFGS-B with
    robust initial parameter guesses and refined bounds.

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    loss_values (np.ndarray): Array of corresponding brier_score values.

    Returns:
    np.ndarray: Optimized parameters [A, x0, w, B, C].
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D
    y = np.asarray(loss_values).flatten() # Ensure y is 1D

    x_min, x_max = np.min(x), np.max(x)
    y_min, y_max = np.min(y), np.max(y)
    x_range = x_max - x_min
    y_range = y_max - y_min

    # --- Initial Parameter Guesses ---
    # 1. Linear regression for initial B (slope) and C (intercept)
    # This provides a baseline trend to identify deviations (the U-shape peak).
    slope, intercept, _, _, _ = linregress(x, y)
    B_init = slope
    C_init = intercept

    # 2. Estimate residuals from the initial linear fit.
    # These residuals help identify the "peak of badness" (positive deviation).
    residuals = y - (B_init * x + C_init)

    # 3. A_init: Amplitude of the "badness" peak. Must be positive.
    #    Estimate from the maximum positive deviation from the initial linear fit.
    #    This represents how much the performance 'worsened' from the linear trend.
    A_init = np.max(residuals)
    # If no significant positive residual, ensure A_init is a minimum positive value
    # to allow the optimizer to explore a peak.
    if A_init < 0.001:
        A_init = 0.001
    # Cap A_init to a reasonable fraction of the range of y to prevent overshooting,
    # which can lead to unstable optimization.
    if A_init > y_range * 0.75:
        A_init = y_range * 0.75


    # 4. x0_init: Center of the peak (worst performance point).
    #    Guess as the x value where residuals are maximal, as this indicates the peak of "badness".
    #    This is a more data-driven and robust guess than simply the mean.
    if np.max(residuals) > 0: # Only if there's a positive peak to find
        x0_init = x[np.argmax(residuals)]
    else: # Fallback if no clear positive peak in residuals (e.g., if data is purely monotonic)
        x0_init = np.mean(x)
    # Ensure x0_init is within the observed log_flops range, or fall back to mean if argmax
    # points to an extreme due to noise or edge effects.
    if not (x_min <= x0_init <= x_max):
        x0_init = np.mean(x)


    # 5. w_init: Width of the peak.
    #    Guess as a fraction (e.g., 1/3) of the total log_flops range.
    w_init = x_range / 3.0
    if w_init < 1e-3: # Ensure w is not too small to prevent numerical instability
        w_init = 1e-3
    if w_init > x_range * 2: # Cap width to not exceed a reasonable multiple of data range
        w_init = x_range * 2.0


    initial_params = [A_init, x0_init, w_init, B_init, C_init]

    # --- Parameter Bounds ---
    # A, w: Must be positive (amplitude and width). Small epsilon to avoid issues.
    # x0: Should be within or slightly outside the observed log_flops range.
    # B, C: Can be any real number.
    bounds = [
        (1e-6, y_range * 5),                  # A (amplitude, positive, upper bound for stability)
        (x_min - x_range*0.1, x_max + x_range*0.1), # x0 (center, slightly wider than data range to allow for peak near edges)
        (1e-6, x_range * 3),                  # w (width, positive, capped for stability, slightly wider upper bound)
        (None, None),                         # B (slope, no bounds)
        (None, None)                          # C (intercept, no bounds)
    ]
    # The slightly extended bounds for x0 and w allow the model to fit a peak
    # that might be just outside the observed data range, improving generalization.


    # Objective function to minimize (Mean Squared Error)
    def objective(params):
        pred = scaling_law_func(x, params)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use L-BFGS-B, a quasi-Newton method that supports bounds,
    # generally more robust for constrained optimization than BFGS.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds)

    # Return optimized parameters if successful.
    # If optimization fails, fallback to initial parameters to ensure a valid return,
    # though L-BFGS-B is generally robust with good initial guesses and bounds.
    params_opt = result.x if result.success else initial_params

    # Post-optimization check to ensure critical parameters adhere to strict positivity
    # (sometimes optimizers can slightly violate bounds due to numerical precision or
    # if a parameter hits a bound exactly).
    params_opt[0] = max(params_opt[0], 1e-9) # A must be positive
    params_opt[2] = max(params_opt[2], 1e-9) # w must be positive

    return params_opt
# EVOLVE-BLOCK-END

#4 Run 3 R² = 0.924689

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolution reverts to a 5-parameter U-shaped scaling law function (Lorentzian peak
on a linear baseline) which previously showed higher fitness. It integrates robust
initial parameter guesses and bounds derived from analysis of previous attempts
to further improve stability and accuracy.

The model captures the characteristic U-shaped or double descent pattern:
performance initially worsens with scale before improving again.
Brier score is negative, where more negative indicates better performance.
The U-shape means performance initially worsens (brier_score moves towards 0)
before improving again (brier_score moves more negative).
"""
import numpy as np
from scipy.optimize import minimize
from scipy.stats import linregress

def scaling_law_func(data_points, params):
    """
    Models a U-shaped relationship (performance worsens then improves) using a
    Lorentzian-like peak on a linear baseline. This allows brier_score (negative,
    more negative is better) to increase (worsen) then decrease (improve).

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    params (list or np.ndarray): Array of 5 parameters [A, x0, w, B, C].
        A: Amplitude of the "badness" peak (positive for brier_score to go up).
        x0: log_flops value at the center of the peak (worst performance).
        w: Width parameter of the peak (positive).
        B: Slope of the underlying linear trend.
        C: Intercept of the underlying linear trend.

    Returns:
    np.ndarray: Predicted brier_score values.
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D

    # Unpack parameters: A, x0, w, B, C (5 parameters)
    A, x0, w, B, C = params

    # Ensure w is not too small to prevent division by zero or excessively large values
    # in the Lorentzian term, enhancing numerical stability.
    w_stable = np.maximum(w, 1e-9)

    # Lorentzian-like peak for "badness" + linear baseline
    # A positive A term creates a bump, pushing negative brier_scores towards zero (worsening).
    # B*x + C models the overall trend. For improvement at larger scales, B is typically negative.
    pred = A / (1 + ((x - x0) / w_stable)**2) + B * x + C

    return pred

def fit_scaling_law(data_points, loss_values):
    """
    Fits the U-shaped scaling law function to data using L-BFGS-B with
    robust initial parameter guesses and bounds.

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    loss_values (np.ndarray): Array of corresponding brier_score values.

    Returns:
    np.ndarray: Optimized parameters [A, x0, w, B, C].
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D
    y = np.asarray(loss_values).flatten() # Ensure y is 1D

    # Calculate ranges for robust initial guesses and bounds
    y_range = np.max(y) - np.min(y)
    x_range = np.max(x) - np.min(x)
    
    # --- Initial Parameter Guesses ---
    # 1. Linear regression for initial B (slope) and C (intercept)
    # This provides a baseline trend.
    slope, intercept, _, _, _ = linregress(x, y)
    B_init = slope
    C_init = intercept

    # 2. Estimate the peak of the U-shape (worst performance)
    # The U-shape means brier_score increases (worsens) then decreases (improves).
    # So we're looking for a maximum in the brier_score relative to the linear trend.
    baseline_init_values = B_init * x + C_init
    residuals = y - baseline_init_values

    # A_init: Amplitude of the "badness" peak (should be positive)
    # Estimate from the maximum positive residual.
    A_init = np.max(residuals) 
    if A_init < 1e-4: # Ensure a minimum positive amplitude if residuals are flat or negative
        A_init = 0.01 # Small default positive amplitude
    
    # Cap A_init to a reasonable value if it's excessively large (e.g., not more than 75% of the y range)
    if y_range > 1e-6 and A_init > y_range * 0.75:
        A_init = y_range * 0.75
    
    # x0_init: Center of the peak, estimated at the log_flops value where residuals are maximal.
    # If argmax points outside the data range (e.g., if the peak is at an edge), use the mean as a fallback.
    x0_init_idx = np.argmax(residuals)
    x0_init = x[x0_init_idx]
    if not (np.min(x) <= x0_init <= np.max(x)):
        x0_init = np.mean(x)

    # w_init: Width of the peak
    # Guess as a fraction of the total log_flops range.
    w_init = x_range / 3.0
    if w_init < 1e-3: # Ensure w is not too small to prevent numerical instability
        w_init = 1e-3
    if w_init > x_range * 2: # Cap width to not exceed a reasonable multiple of data range
        w_init = x_range * 2.0

    initial_params = [A_init, x0_init, w_init, B_init, C_init]

    # --- Parameter Bounds ---
    # These bounds constrain the parameters to physically meaningful and numerically stable ranges.
    bounds = [
        (1e-6, np.maximum(y_range * 2, 0.1)), # A (amplitude): Must be positive, capped relative to y_range.
        (np.min(x) - x_range*0.2, np.max(x) + x_range*0.2), # x0 (center): Allow slightly outside data range.
        (1e-6, x_range * 5),   # w (width): Must be positive, capped relative to x_range.
        (None, None),          # B (slope): Can be any real number.
        (None, None)           # C (intercept): Can be any real number.
    ]

    # Objective function to minimize (Mean Squared Error)
    def objective(params):
        # Manually enforce positivity constraints for A and w before passing to scaling_law_func.
        # This helps the optimizer stay within valid regions, especially if bounds are loose.
        p_constrained = list(params)
        p_constrained[0] = np.maximum(p_constrained[0], 1e-9) # A must be positive
        p_constrained[2] = np.maximum(p_constrained[2], 1e-9) # w must be positive
        
        pred = scaling_law_func(x, p_constrained)
        return np.mean((pred - y) ** 2)

    # Use L-BFGS-B, a quasi-Newton method that supports bounds, for robust optimization.
    # Increased maxiter to allow for more complex optimization landscapes.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds, options={'maxiter': 2000})

    # Return optimized parameters if successful, otherwise the initial guesses (converted to array).
    params_opt = result.x if result.success else np.array(initial_params)
    
    # Final check for 'A' and 'w' to ensure they are positive, for consistency in returned parameters.
    params_opt[0] = np.maximum(params_opt[0], 1e-9) # A
    params_opt[2] = np.maximum(params_opt[2], 1e-9) # w

    return params_opt
# EVOLVE-BLOCK-END

#5 Run 4 R² = 0.924689

▼

Python

# EVOLVE-BLOCK-START
"""
Scaling law discovery for LLM finetuning scenarios.
This evolution refines the U-shaped scaling law function and optimization algorithm
to achieve higher fitness. It utilizes a robust 5-parameter model (Lorentzian peak
on a linear baseline) which has demonstrated better performance on this dataset
compared to more complex forms. The optimization maintains improved initial parameter
guesses and bounds, enhancing stability and convergence for fitting the characteristic
U-shaped pattern where performance initially worsens before improving.

Specifically, the initial guess for the peak center (x0) is now derived from the
point of maximum positive deviation from the initial linear baseline, which is a
more targeted approach to locate the "badness" peak. Additionally, bounds for 'A'
and 'w' are made more explicit relative to the data range, and numerical stability
improvements are added.
"""
import numpy as np
from scipy.optimize import minimize
from scipy.stats import linregress

def scaling_law_func(data_points, params):
    """
    Models a U-shaped relationship (performance worsens then improves) using a
    Lorentzian-like peak on a linear baseline. This allows brier_score (negative,
    more negative is better) to increase (worsen) then decrease (improve).

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    params (list or np.ndarray): Array of 5 parameters [A, x0, w, B, C].
        A: Amplitude of the "badness" peak (positive for brier_score to go up).
        x0: log_flops value at the center of the peak (worst performance).
        w: Width parameter of the peak (positive).
        B: Slope of the underlying linear trend.
        C: Intercept of the underlying linear trend.

    Returns:
    np.ndarray: Predicted brier_score values.
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D

    # Unpack parameters: A, x0, w, B, C (5 parameters)
    A, x0, w, B, C = params

    # Lorentzian-like peak for "badness" + linear baseline
    # A positive A term creates a bump, pushing negative brier_scores towards zero (worsening).
    # B*x + C models the overall trend.
    # Ensure w is not too small to prevent division by zero or large numbers.
    w_safe = np.maximum(w, 1e-6)
    pred = A / (1 + ((x - x0) / w_safe)**2) + B * x + C

    return pred

def fit_scaling_law(data_points, loss_values):
    """
    Fits the U-shaped scaling law function (Lorentzian peak + linear baseline)
    to data using L-BFGS-B with robust initial parameter guesses and bounds.

    Parameters:
    data_points (np.ndarray): (N,1) array with columns [log_flops].
    loss_values (np.ndarray): Array of corresponding brier_score values.

    Returns:
    np.ndarray: Optimized parameters [A, x0, w, B, C].
    """
    x = np.asarray(data_points).flatten() # Ensure x is 1D
    y = np.asarray(loss_values).flatten() # Ensure y is 1D

    # Calculate min/max of x and y for bounds and initial guesses
    x_min, x_max = np.min(x), np.max(x)
    y_min, y_max = np.min(y), np.max(y)
    y_range = y_max - y_min
    if y_range == 0:
        y_range = 1.0 # Prevent division by zero if all y values are identical

    # --- Initial Parameter Guesses ---
    # 1. Linear regression for initial B (slope) and C (intercept)
    # This provides a baseline trend.
    # Handle cases where linregress might fail due to insufficient unique x points
    if len(np.unique(x)) < 2:
        B_init = 0.0
        C_init = np.mean(y)
    else:
        slope, intercept, _, _, _ = linregress(x, y)
        B_init = slope
        C_init = intercept

    # 2. Estimate initial baseline prediction and residuals
    baseline_pred_init = B_init * x + C_init
    residuals = y - baseline_pred_init

    # 3. A_init: Amplitude of the "badness" peak
    # This parameter should be positive (pushes brier_score towards 0).
    # Estimate from the maximum positive deviation from the initial linear fit.
    A_init = np.max(residuals) if np.max(residuals) > 0 else 0.01

    # Cap A_init to a reasonable value and ensure a minimum positive amplitude
    if A_init > y_range * 0.5: # Allow A to be up to half the total y range
        A_init = y_range * 0.5
    if A_init < 0.001:
        A_init = 0.001

    # 4. x0_init: Center of the peak (worst performance point)
    # Guess as the x-value where the maximum positive residual occurs.
    # This is a more targeted guess than simply the mean of x.
    if np.max(residuals) > 0:
        x0_init = x[np.argmax(residuals)]
    else: # If no significant positive residual, peak is probably small or absent, center it broadly.
        x0_init = np.mean(x)
    
    # Ensure x0_init is within the observed log_flops range
    x0_init = np.clip(x0_init, x_min, x_max)

    # 5. w_init: Width of the peak
    # Guess as a fraction (e.g., 1/3) of the total log_flops range.
    w_init = (x_max - x_min) / 3.0
    if w_init < 1e-3: # Ensure w is not too small to prevent numerical instability
        w_init = 1e-3
    w_init = np.clip(w_init, 1e-3, x_max - x_min) # Cap max width to prevent excessively flat peaks

    initial_params = [A_init, x0_init, w_init, B_init, C_init]

    # --- Parameter Bounds ---
    # A, w: Must be positive (amplitude and width). Small epsilon to avoid issues.
    # x0: Should be within the observed log_flops range.
    # B, C: Can be any real number.
    bounds = [
        (1e-6, y_range * 2),   # A (amplitude), upper bound based on y_range for robustness
        (x_min, x_max),        # x0 (center of peak), constrained to data range
        (1e-6, x_max - x_min), # w (width), upper bound based on x_range for robustness
        (None, None),          # B (slope)
        (None, None)           # C (intercept)
    ]

    # Objective function to minimize (Mean Squared Error)
    def objective(params):
        pred = scaling_law_func(x, params)
        mse = np.mean((pred - y) ** 2)
        return mse

    # Use L-BFGS-B, a quasi-Newton method that supports bounds,
    # generally more robust for constrained optimization than BFGS.
    result = minimize(objective, initial_params, method='L-BFGS-B', bounds=bounds, options={'maxiter': 5000})

    # Return optimized parameters if successful, otherwise the initial guesses
    # (or a fallback to prevent errors, though L-BFGS-B is usually robust with good bounds)
    params_opt = result.x if result.success else initial_params

    return params_opt
# EVOLVE-BLOCK-END