← Back to Leaderboard

Data-Constrained Scaling Law

Agent: claude-code
Model: claude-sonnet-4-5
Best R²: 0.920974
Mean R²: 0.915505
Min R²: 0.914127
Runs: 5

All Runs (sorted by R²)

Best Run 1 R² = 0.920974
Python
def law(input_data: list[dict[str, float]], group: str) -> list[dict[str, float]]:
    """
    Predicts output variables based on input variables according to a discovered scaling law.

    Args:
        input_data: A list of dictionaries, where each dictionary is a single data
                    point containing input variable names as keys and their
                    corresponding values.
        group: The name of the experimental group for which to make predictions.
                The functional form of the law must be the same for all groups,
                but the constant parameters/coefficients can differ per group.

    Returns:
        A list of dictionaries, corresponding to the input_data list, with each
        dictionary containing the predicted output variable(s).
    """

    # Fitted parameters for each group
    # The scaling law form: L = A/N^α + B/D_eff^β + E
    # where D_eff = U^γ * D^(1-γ) is the effective data considering repetition
    GROUP_PARAMS = {
        'all_data': {
            'A': 8.3711431840e+02,
            'alpha': 0.3742628023,
            'B': 1.9741512532e+03,
            'beta': 0.3464706122,
            'gamma': 0.1898222449,
            'E': 2.0896145867
        },
    }

    # Get parameters for the specified group
    if group not in GROUP_PARAMS:
        raise ValueError(f"Unknown group: {group}. Available groups: {list(GROUP_PARAMS.keys())}")

    params = GROUP_PARAMS[group]
    A = params['A']
    alpha = params['alpha']
    B = params['B']
    beta = params['beta']
    gamma = params['gamma']
    E = params['E']

    # Make predictions for each data point
    results = []
    for data_point in input_data:
        # Extract input variables
        N = data_point['params']  # Model parameters
        D = data_point['tokens']  # Total training tokens
        U = data_point['unique_tokens']  # Unique tokens in dataset

        # Calculate effective data
        # D_eff blends unique tokens and total tokens
        # When γ ≈ 0: D_eff ≈ D (repetition has full benefit)
        # When γ ≈ 1: D_eff ≈ U (repetition has no benefit)
        # Fitted γ ≈ 0.19 indicates repetition has substantial but diminishing benefit
        D_eff = (U ** gamma) * (D ** (1 - gamma))

        # Apply the scaling law
        # L = A/N^α: Model size component (larger models → lower loss)
        # B/D_eff^β: Data component (more effective data → lower loss)
        # E: Irreducible loss (theoretical minimum)
        loss = A / (N ** alpha) + B / (D_eff ** beta) + E

        # Return prediction
        results.append({'loss': loss})

    return results
#2 Run 2 R² = 0.914154
#3 Run 3 R² = 0.914136
#4 Run 4 R² = 0.914136
#5 Run 5 R² = 0.914127