Skip to content

Action Chunking

A robot that predicts one action at a time is a robot that stutters. Neon predicts 16 future actions simultaneously — and this single decision changes everything about control quality.


The Problem

A naive policy: observe → predict 1 action → execute → observe → predict → ...

Three things go wrong:

  1. Compounding errors — Each prediction uses the previous noisy prediction as context. Small errors snowball into catastrophic drift.
  2. Jerky motion — Independent per-step predictions have no temporal consistency. The robot vibrates.
  3. Latency bottleneck — If inference takes 80ms, you're stuck at 12 Hz. The robot is always reacting to the past.

The Solution

Predict a chunk of the future at once:

Observe → Predict 16 actions → Execute action[0] → Execute action[1] → ... → Re-observe → Predict 16 more
sequenceDiagram
    participant Camera
    participant Model
    participant Robot

    Camera->>Model: Frame t
    Model->>Robot: Actions [t, t+1, ..., t+15]
    Robot->>Robot: Execute action[t]
    Robot->>Robot: Execute action[t+1]
    Robot->>Robot: Execute action[t+2]
    Note over Robot: Continue executing chunk...
    Camera->>Model: Frame t+K (re-observe)
    Model->>Robot: New chunk [t+K, ..., t+K+15]

Why It Works

Temporal Consistency

The model is forced to predict a coherent trajectory, not isolated snapshots. The loss penalizes the entire sequence. The model learns that step 3 must follow smoothly from step 2:

loss = MSE(predicted_chunk, target_chunk)  # Shape: (batch, 16, action_dim)

Fewer Compounding Steps

With chunks of 16 at 50 Hz, you re-predict every 320ms instead of every 20ms. That's 16× fewer points where errors can compound.

Amortized Inference

If your model takes 50ms to run, a single-step policy runs at 20 Hz. With chunking, you get 16 actions for that 50ms. The robot executes them at full 50 Hz while the next chunk computes in the background.


How Neon Does It

The ActionChunkingHead uses learnable temporal step embeddings — the model learns what "now" means versus "300ms from now":

class ActionChunkingHead(nn.Module):
    def __init__(self, input_dim, action_dim, num_steps=16, hidden_dim=512):
        self.step_embed = nn.Embedding(num_steps, hidden_dim)  # Per-step identity
        self.feature_proj = nn.Linear(input_dim, hidden_dim)
        self.decoder = nn.Sequential(...)                       # ReLU², RMSNorm, soft-cap

    def forward(self, features):
        feat = self.feature_proj(features)              # (batch, hidden_dim)
        step_emb = self.step_embed(arange(16))          # (16, hidden_dim)

        feat = feat.unsqueeze(1).expand(-1, 16, -1)     # Broadcast to all steps
        combined = cat([feat, step_emb], dim=-1)         # (batch, 16, hidden_dim*2)
        return self.decoder(combined)                    # (batch, 16, action_dim)

The same visual features are decoded with different step embeddings for each future timestep. Step 0 learns "what to do now". Step 15 learns "what to do 300ms from now". Same observation, different temporal perspectives.


Choosing Chunk Size

Chunk Control Freq Re-predict Every Character
1 50 Hz 20ms No chunking. The baseline.
4 50 Hz 80ms Minimal smoothing, fast reactions
8 50 Hz 160ms Good balance — reactive + smooth
16 50 Hz 320ms Neon's default — smooth manipulation
32 50 Hz 640ms Very smooth but slow to react

Match chunk size to the task

Pouring water needs smooth trajectories → chunk=16. Catching a thrown ball needs fast reactions → chunk=4-8.


Temporal Smoothing at Inference

Neon also applies exponential moving average smoothing to blend successive chunks:

# In NeonInferenceServer — alpha=0.7 by default
new_actions = 0.7 * predicted + 0.3 * previous_actions

This prevents discontinuities at chunk boundaries. Higher alpha = more responsive. Lower alpha = smoother. Call server.reset() between episodes to clear the state.


References

  • ACT (Zhao et al., 2023) — Introduced action chunking for imitation learning
  • Diffusion Policy (Chi et al., 2023) — Diffusion-based action sequence prediction
  • GR00T N1 (NVIDIA, 2025) — VQ-BeT action tokenization with temporal chunking

Next: Real-Time Chunking — what happens when inference is slower than the robot