Action Chunking¶
A robot that predicts one action at a time is a robot that stutters. Neon predicts 16 future actions simultaneously — and this single decision changes everything about control quality.
The Problem¶
A naive policy: observe → predict 1 action → execute → observe → predict → ...
Three things go wrong:
- Compounding errors — Each prediction uses the previous noisy prediction as context. Small errors snowball into catastrophic drift.
- Jerky motion — Independent per-step predictions have no temporal consistency. The robot vibrates.
- Latency bottleneck — If inference takes 80ms, you're stuck at 12 Hz. The robot is always reacting to the past.
The Solution¶
Predict a chunk of the future at once:
Observe → Predict 16 actions → Execute action[0] → Execute action[1] → ... → Re-observe → Predict 16 more
sequenceDiagram
participant Camera
participant Model
participant Robot
Camera->>Model: Frame t
Model->>Robot: Actions [t, t+1, ..., t+15]
Robot->>Robot: Execute action[t]
Robot->>Robot: Execute action[t+1]
Robot->>Robot: Execute action[t+2]
Note over Robot: Continue executing chunk...
Camera->>Model: Frame t+K (re-observe)
Model->>Robot: New chunk [t+K, ..., t+K+15]
Why It Works¶
Temporal Consistency¶
The model is forced to predict a coherent trajectory, not isolated snapshots. The loss penalizes the entire sequence. The model learns that step 3 must follow smoothly from step 2:
Fewer Compounding Steps¶
With chunks of 16 at 50 Hz, you re-predict every 320ms instead of every 20ms. That's 16× fewer points where errors can compound.
Amortized Inference¶
If your model takes 50ms to run, a single-step policy runs at 20 Hz. With chunking, you get 16 actions for that 50ms. The robot executes them at full 50 Hz while the next chunk computes in the background.
How Neon Does It¶
The ActionChunkingHead uses learnable temporal step embeddings — the model learns what "now" means versus "300ms from now":
class ActionChunkingHead(nn.Module):
def __init__(self, input_dim, action_dim, num_steps=16, hidden_dim=512):
self.step_embed = nn.Embedding(num_steps, hidden_dim) # Per-step identity
self.feature_proj = nn.Linear(input_dim, hidden_dim)
self.decoder = nn.Sequential(...) # ReLU², RMSNorm, soft-cap
def forward(self, features):
feat = self.feature_proj(features) # (batch, hidden_dim)
step_emb = self.step_embed(arange(16)) # (16, hidden_dim)
feat = feat.unsqueeze(1).expand(-1, 16, -1) # Broadcast to all steps
combined = cat([feat, step_emb], dim=-1) # (batch, 16, hidden_dim*2)
return self.decoder(combined) # (batch, 16, action_dim)
The same visual features are decoded with different step embeddings for each future timestep. Step 0 learns "what to do now". Step 15 learns "what to do 300ms from now". Same observation, different temporal perspectives.
Choosing Chunk Size¶
| Chunk | Control Freq | Re-predict Every | Character |
|---|---|---|---|
| 1 | 50 Hz | 20ms | No chunking. The baseline. |
| 4 | 50 Hz | 80ms | Minimal smoothing, fast reactions |
| 8 | 50 Hz | 160ms | Good balance — reactive + smooth |
| 16 | 50 Hz | 320ms | Neon's default — smooth manipulation |
| 32 | 50 Hz | 640ms | Very smooth but slow to react |
Match chunk size to the task
Pouring water needs smooth trajectories → chunk=16. Catching a thrown ball needs fast reactions → chunk=4-8.
Temporal Smoothing at Inference¶
Neon also applies exponential moving average smoothing to blend successive chunks:
# In NeonInferenceServer — alpha=0.7 by default
new_actions = 0.7 * predicted + 0.3 * previous_actions
This prevents discontinuities at chunk boundaries. Higher alpha = more responsive. Lower alpha = smoother. Call server.reset() between episodes to clear the state.
References¶
- ACT (Zhao et al., 2023) — Introduced action chunking for imitation learning
- Diffusion Policy (Chi et al., 2023) — Diffusion-based action sequence prediction
- GR00T N1 (NVIDIA, 2025) — VQ-BeT action tokenization with temporal chunking
→ Next: Real-Time Chunking — what happens when inference is slower than the robot