Parameter Golf v2¶
When your entire trainable model fits in 25 megabytes, every trick matters. These six techniques come from the OpenAI Parameter Golf competition and MicroGPT — the art of making small models punch above their weight.
Why This Matters Here¶
graph LR
BB["Frozen Backbone<br/>3-7B params"] --> FUS["Fusion"] --> AH["Action Heads<br/>~6M params"]
style BB fill:#1565c0,color:#fff
style AH fill:#e65100,color:#fff
The backbone is frozen. All learning happens in ~6M parameters. The action heads are our small model. Every technique from a competition to minimize perplexity per parameter byte applies directly to making a robot move better with fewer weights.
Six Techniques¶
1. ReLU² — Smoother Gradients¶
class ReLUSquared(nn.Module):
"""max(0, x)² — smoother than GELU, cheaper to compute."""
def forward(self, x):
return F.relu(x).square()
The gradient is 2·max(0,x) — continuous at zero, unlike ReLU's cliff. For action heads, this means smoother trajectories. Cheaper than GELU's exp().
2. RMSNorm — Half the Cost¶
class RMSNorm(nn.Module):
"""x / √(mean(x²)) — no mean subtraction, no bias."""
def forward(self, x):
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight
LayerNorm subtracts the mean AND divides by std. RMSNorm just divides by the root mean square. Same training stability, fewer operations.
3. Learnable Residual Scales¶
self.residual_scales = nn.ParameterList([nn.Parameter(torch.ones(1)) for _ in range(n)])
# In forward: h = h + α · h_in (α is learned, starts at 1.0)
Frozen backbone features dominate early in training. The learned α lets the network decide — per layer — how much residual signal to preserve. The network learns its own trust ratio.
4. U-Net Skip Connections¶
In 3–4 layer MLPs, middle layers can become gradient bottlenecks. Skipping from layer 0 to the output creates a gradient highway — the signal can always flow directly. Only activates for num_layers >= 3.
5. Logit Soft-Capping¶
def soft_cap(x, cap=1.0):
"""cap · tanh(x / cap) — approaches ±1 asymptotically, never hits it."""
return cap * torch.tanh(x / cap)
graph LR
subgraph "Hard Tanh"
HT["Hits wall at ±1<br/>Gradient dies"]
end
subgraph "Soft-Cap"
SC["Approaches ±1 smoothly<br/>Gradient always flows"]
end
style SC fill:#1b5e20,color:#fff
style HT fill:#b71c1c,color:#fff
Hard Tanh kills gradients at the boundary. When an action prediction is near ±1, the optimizer can't fine-tune it. Soft-capping maintains gradient flow at the extremes, letting the model make precise adjustments where they matter most.
6. Optimizer Tuning¶
| Parameter | Standard | Parameter Golf | Why |
|---|---|---|---|
adam_β₁ |
0.9 | 0.85 | Faster adaptation — action distributions shift mid-training |
adam_β₂ |
0.999 | 0.99 | Less momentum lag |
max_grad_norm |
1.0 | 0.3 | Small heads diverge easily — clip hard |
Configuration¶
All v2 features are on by default but individually togglable:
from neon.model.action_heads import ActionHeadConfig
# Full v2 (default — all tricks enabled)
config = ActionHeadConfig(
use_relu_squared=True,
use_rmsnorm=True,
use_residual_scale=True,
use_skip_connections=True,
soft_cap_value=1.0,
)
# Legacy mode (disable everything)
config = ActionHeadConfig(
use_relu_squared=False, # → GELU
use_rmsnorm=False, # → Identity
use_residual_scale=False, # → No learned scales
use_skip_connections=False, # → No U-Net skip
soft_cap_value=0.0, # → Hard Tanh
)
Origin Story¶
These techniques were discovered by analyzing winning entries in OpenAI's Parameter Golf competition, where contestants minimized perplexity per parameter byte. The pure-Python implementation lives in MicroGPT's engine.py — built from scratch with no PyTorch, no dependencies, just math. We ported the concepts to PyTorch for Neon's action decoders, where every saved byte means a faster robot.