Action Chunking¶

A robot that predicts one action at a time is a robot that stutters. Neon predicts 16 future actions simultaneously — and this single decision changes everything about control quality.

The Problem¶

A naive policy: observe → predict 1 action → execute → observe → predict → ...

Three things go wrong:

Compounding errors — Each prediction uses the previous noisy prediction as context. Small errors snowball into catastrophic drift.
Jerky motion — Independent per-step predictions have no temporal consistency. The robot vibrates.
Latency bottleneck — If inference takes 80ms, you're stuck at 12 Hz. The robot is always reacting to the past.

The Solution¶

Predict a chunk of the future at once:

Observe → Predict 16 actions → Execute action[0] → Execute action[1] → ... → Re-observe → Predict 16 more

sequenceDiagram
    participant Camera
    participant Model
    participant Robot

    Camera->>Model: Frame t
    Model->>Robot: Actions [t, t+1, ..., t+15]
    Robot->>Robot: Execute action[t]
    Robot->>Robot: Execute action[t+1]
    Robot->>Robot: Execute action[t+2]
    Note over Robot: Continue executing chunk...
    Camera->>Model: Frame t+K (re-observe)
    Model->>Robot: New chunk [t+K, ..., t+K+15]

Why It Works¶

Temporal Consistency¶

The model is forced to predict a coherent trajectory, not isolated snapshots. The loss penalizes the entire sequence. The model learns that step 3 must follow smoothly from step 2:

loss = MSE(predicted_chunk, target_chunk)  # Shape: (batch, 16, action_dim)

Fewer Compounding Steps¶

With chunks of 16 at 50 Hz, you re-predict every 320ms instead of every 20ms. That's 16× fewer points where errors can compound.

Amortized Inference¶

If your model takes 50ms to run, a single-step policy runs at 20 Hz. With chunking, you get 16 actions for that 50ms. The robot executes them at full 50 Hz while the next chunk computes in the background.

How Neon Does It¶

The ActionChunkingHead uses learnable temporal step embeddings — the model learns what "now" means versus "300ms from now":

class ActionChunkingHead(nn.Module):
    def __init__(self, input_dim, action_dim, num_steps=16, hidden_dim=512):
        self.step_embed = nn.Embedding(num_steps, hidden_dim)  # Per-step identity
        self.feature_proj = nn.Linear(input_dim, hidden_dim)
        self.decoder = nn.Sequential(...)                       # ReLU², RMSNorm, soft-cap

    def forward(self, features):
        feat = self.feature_proj(features)              # (batch, hidden_dim)
        step_emb = self.step_embed(arange(16))          # (16, hidden_dim)

        feat = feat.unsqueeze(1).expand(-1, 16, -1)     # Broadcast to all steps
        combined = cat([feat, step_emb], dim=-1)         # (batch, 16, hidden_dim*2)
        return self.decoder(combined)                    # (batch, 16, action_dim)

The same visual features are decoded with different step embeddings for each future timestep. Step 0 learns "what to do now". Step 15 learns "what to do 300ms from now". Same observation, different temporal perspectives.

Choosing Chunk Size¶

Chunk	Control Freq	Re-predict Every	Character
1	50 Hz	20ms	No chunking. The baseline.
4	50 Hz	80ms	Minimal smoothing, fast reactions
8	50 Hz	160ms	Good balance — reactive + smooth
16	50 Hz	320ms	Neon's default — smooth manipulation
32	50 Hz	640ms	Very smooth but slow to react

Match chunk size to the task

Pouring water needs smooth trajectories → chunk=16. Catching a thrown ball needs fast reactions → chunk=4-8.

Temporal Smoothing at Inference¶

Neon also applies exponential moving average smoothing to blend successive chunks:

# In NeonInferenceServer — alpha=0.7 by default
new_actions = 0.7 * predicted + 0.3 * previous_actions

This prevents discontinuities at chunk boundaries. Higher alpha = more responsive. Lower alpha = smoother. Call server.reset() between episodes to clear the state.

References¶

ACT (Zhao et al., 2023) — Introduced action chunking for imitation learning
Diffusion Policy (Chi et al., 2023) — Diffusion-based action sequence prediction
GR00T N1 (NVIDIA, 2025) — VQ-BeT action tokenization with temporal chunking

→ Next: Real-Time Chunking — what happens when inference is slower than the robot