Skip to content

Parameter Golf v2

When your entire trainable model fits in 25 megabytes, every trick matters. These six techniques come from the OpenAI Parameter Golf competition and MicroGPT — the art of making small models punch above their weight.


Why This Matters Here

graph LR
    BB["Frozen Backbone<br/>3-7B params"] --> FUS["Fusion"] --> AH["Action Heads<br/>~6M params"]

    style BB fill:#1565c0,color:#fff
    style AH fill:#e65100,color:#fff

The backbone is frozen. All learning happens in ~6M parameters. The action heads are our small model. Every technique from a competition to minimize perplexity per parameter byte applies directly to making a robot move better with fewer weights.


Six Techniques

1. ReLU² — Smoother Gradients

class ReLUSquared(nn.Module):
    """max(0, x)² — smoother than GELU, cheaper to compute."""
    def forward(self, x):
        return F.relu(x).square()

The gradient is 2·max(0,x) — continuous at zero, unlike ReLU's cliff. For action heads, this means smoother trajectories. Cheaper than GELU's exp().

2. RMSNorm — Half the Cost

class RMSNorm(nn.Module):
    """x / √(mean(x²)) — no mean subtraction, no bias."""
    def forward(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight

LayerNorm subtracts the mean AND divides by std. RMSNorm just divides by the root mean square. Same training stability, fewer operations.

3. Learnable Residual Scales

self.residual_scales = nn.ParameterList([nn.Parameter(torch.ones(1)) for _ in range(n)])
# In forward: h = h + α · h_in   (α is learned, starts at 1.0)

Frozen backbone features dominate early in training. The learned α lets the network decide — per layer — how much residual signal to preserve. The network learns its own trust ratio.

4. U-Net Skip Connections

# Layer 0 output saved as 'skip'
# Last hidden layer: h = cat([h, skip], dim=-1)

In 3–4 layer MLPs, middle layers can become gradient bottlenecks. Skipping from layer 0 to the output creates a gradient highway — the signal can always flow directly. Only activates for num_layers >= 3.

5. Logit Soft-Capping

def soft_cap(x, cap=1.0):
    """cap · tanh(x / cap) — approaches ±1 asymptotically, never hits it."""
    return cap * torch.tanh(x / cap)
graph LR
    subgraph "Hard Tanh"
        HT["Hits wall at ±1<br/>Gradient dies"]
    end
    subgraph "Soft-Cap"
        SC["Approaches ±1 smoothly<br/>Gradient always flows"]
    end

    style SC fill:#1b5e20,color:#fff
    style HT fill:#b71c1c,color:#fff

Hard Tanh kills gradients at the boundary. When an action prediction is near ±1, the optimizer can't fine-tune it. Soft-capping maintains gradient flow at the extremes, letting the model make precise adjustments where they matter most.

6. Optimizer Tuning

Parameter Standard Parameter Golf Why
adam_β₁ 0.9 0.85 Faster adaptation — action distributions shift mid-training
adam_β₂ 0.999 0.99 Less momentum lag
max_grad_norm 1.0 0.3 Small heads diverge easily — clip hard

Configuration

All v2 features are on by default but individually togglable:

from neon.model.action_heads import ActionHeadConfig

# Full v2 (default — all tricks enabled)
config = ActionHeadConfig(
    use_relu_squared=True,
    use_rmsnorm=True,
    use_residual_scale=True,
    use_skip_connections=True,
    soft_cap_value=1.0,
)

# Legacy mode (disable everything)
config = ActionHeadConfig(
    use_relu_squared=False,      # → GELU
    use_rmsnorm=False,           # → Identity
    use_residual_scale=False,    # → No learned scales
    use_skip_connections=False,  # → No U-Net skip
    soft_cap_value=0.0,          # → Hard Tanh
)

Origin Story

These techniques were discovered by analyzing winning entries in OpenAI's Parameter Golf competition, where contestants minimized perplexity per parameter byte. The pure-Python implementation lives in MicroGPT's engine.py — built from scratch with no PyTorch, no dependencies, just math. We ported the concepts to PyTorch for Neon's action decoders, where every saved byte means a faster robot.