Skip to content

Real-Time Chunking (RTC)

Why predicted actions go stale during inference, and how to fix it with delay-aware blending.


The Problem

Action chunking predicts 16 future actions at once. But inference takes time — ~50ms on Jetson Orin. During those 50ms, the robot keeps moving. By the time the new prediction arrives, the first few actions are stale.

Time ──────────────────────────────────────────────►

Chunk 1: [a₁ a₂ a₃ a₄ a₅ a₆ ... a₁₆]
              │ Inference starts (50ms @ 50Hz = ~3 steps)
              │ Robot executes a₃, a₄, a₅ while waiting
Chunk 2: [b₁ b₂ b₃ b₄ b₅ b₆ ... b₁₆]
          └── b₁, b₂, b₃ are stale — robot already passed these!

Naive approach: execute b₁ anyway. The robot jerks backward, then catches up. This is the chunk boundary problem — the most common source of discontinuous motion in VLA policies.


Two Fixes (No Denoiser Required)

Real-Time Chunking was introduced by Physical Intelligence for flow-matching policies like π₀. Their full approach uses gradient-based guidance during iterative denoising.

Neon uses a single-pass MLP — no iterative denoising. But two of RTC's three ideas are model-agnostic and apply directly:

1. Skip Stale Actions

Measure actual inference latency. Convert to timesteps. Skip them.

delay_steps = ⌈latency × control_freq⌉
            = ⌈0.050 × 50⌉
            = 3

Chunk 2: [b₁ b₂ b₃ | b₄ b₅ b₆ ... b₁₆]
          ─────────   ───────────────────
            skip 3     execute from here

The robot never receives stale commands. No jerk.

2. Blend the Overlap

When the new chunk arrives, leftover actions from the old chunk are still valid predictions for the near future. Instead of throwing them away, blend old and new with decaying weights:

Old leftover:  [a₆  a₇  a₈  a₉  a₁₀  a₁₁  ...  ]
New chunk:     [b₄  b₅  b₆  b₇  b₈   b₉   ...  b₁₆]
Weights:       [0.9 0.7 0.5 0.3 0.1  0.0   ...  0.0 ]
                ─── high trust ───  ── transition ──  ── new only ──

Near the overlap start: trust the old prediction (the robot committed to it). Further out: trust the new prediction (it has fresher observations).


Blend Schedules

Neon supports four blending strategies:

Weights decay linearly from 1 → 0 across the overlap region. Good default for most tasks.

w(t) = 1 - t/horizon

Faster initial decay, holds longer at the boundary. Smoother for fine manipulation.

w(t) = linear(t) × expm1(linear(t)) / (e - 1)

Uniform weight across all overlapping steps. This is what Neon v1 did — a flat α/β split.

w = 0.3 everywhere in overlap

No blending. Always use the newest prediction. Most responsive, but potential discontinuities.

w = 0 everywhere

Visual Comparison

      OLD chunk          NEW chunk
      ──────────         ──────────

EMA:  ░░░░░░░░░░  →  ████████████
      uniform 30%       uniform 70%

Linear: █▓▓▒▒░░░  →  ░░▒▒▓▓██████
        1.0 → 0.0       0.0 → 1.0

Latest: ░░░░░░░░░░  →  ████████████
        ignore old       100% new

Usage

Default (RTC enabled)

from neon.inference.server import NeonInferenceServer

server = NeonInferenceServer(
    model_path="cagataydev/neon-g1-v1",
    blend_schedule="linear",    # "linear" | "exp" | "ema" | "latest"
    execution_horizon=10,       # blend across 10 steps
    control_freq=50.0,          # robot control rate
)

# Control loop
while running:
    action = server.get_action()

    if action is None:
        # Queue empty — time for a new prediction
        server.predict(
            image=camera.read(),
            instruction="Pick up the red cup",
            proprioception=robot.get_joints(),
            rtc=True,   # ← delay-aware queue
        )
        action = server.get_action()

    robot.send(action)
    time.sleep(1/50)  # 50 Hz control

Legacy Mode (Neon v1)

output = server.predict(
    image=frame,
    instruction="Pick up the cup",
    proprioception=joints,
    smooth=True,  # ← EMA only, no queue
    rtc=False,
)

HTTP API

# Start with RTC
python -m neon.inference.server \
    --model cagataydev/neon-g1-v1 \
    --blend linear \
    --horizon 10 \
    --freq 50

# Predict + queue actions
curl -X POST http://localhost:8300/predict \
    -d '{"instruction": "wave", "rtc": true}'

# Pop next action from queue
curl -X POST http://localhost:8300/action

Tuning Guide

Parameter Default Effect of ↑ Effect of ↓
execution_horizon 10 Smoother transitions, slower to react More responsive, potential jumps
control_freq 50 More granular delay compensation Fewer steps skipped
blend_schedule linear

Rules of thumb:

  • Fine manipulation (pouring, insertion): exp schedule, horizon 12-16
  • Fast reaching (pick up objects): linear schedule, horizon 6-8
  • Locomotion (walking): linear schedule, horizon 10
  • Debugging: latest schedule (see raw predictions, no blending)

What We Didn't Port (and Why)

RTC's full algorithm includes gradient-based prefix guidance — during each denoising step, it computes ∂x₁/∂xₜ and applies a correction that steers the prediction toward the previous chunk's trajectory.

This requires:

  1. Iterative denoising — a loop of N steps refining noise → actions
  2. Differentiable denoisertorch.autograd.grad() through the model
  3. Time parameter τ — normalized denoising progress for guidance weight

Neon's action decoder is a single-pass MLP: one forward pass, done. No iteration, no τ, no gradient to steer. The prefix guidance is elegant but fundamentally tied to flow-matching architectures.

The two things we did port — delay skipping and prefix blending — give 80% of the benefit with 0% of the complexity.

Future: if Neon adds a diffusion/flow head

If we add a diffusion-based action head (see Action Heads), full RTC guidance becomes possible. The ActionQueue is already designed to support it — the _blend_with_prefix method would be replaced by gradient-guided denoising.


References


Next: Data Soup — mixing data sources for robust training