Skip to content

Inference & Hardware Control

From prediction to physical movement. The inference server, the G1 controller, and the closed-loop that ties them together.


Inference Server

Load a trained model, predict actions, serve them fast:

from neon.inference.server import NeonInferenceServer

server = NeonInferenceServer(
    model_path="cagataydev/neon-g1-v1",  # HuggingFace ID or local path
    device="auto",                         # auto: CUDA > MPS > CPU
)

output = server.predict(
    image=camera_frame,
    instruction="Pick up the red cup",
    proprioception=joint_states,
    smooth=True,                           # Temporal EMA smoothing
)

Temporal Smoothing

The server supports two modes for handling chunk transitions:

RTC mode (default, recommended): Delay-aware action queue with prefix blending. See Real-Time Chunking for the full explanation.

server = NeonInferenceServer(
    model_path="cagataydev/neon-g1-v1",
    blend_schedule="linear",    # or "exp", "ema", "latest"
    execution_horizon=10,
)

# RTC control loop
output = server.predict(image=frame, instruction=task, proprioception=joints, rtc=True)
action = server.get_action()  # delay-aware, blended

Legacy EMA mode: Simple exponential moving average (Neon v1 behavior):

output = server.predict(..., smooth=True, rtc=False)
# α = 0.7 by default
# new_actions = 0.7 * predicted + 0.3 * previous_actions

HTTP Server

Run inference as an HTTP endpoint for remote control:

python -m neon.inference.server --model cagataydev/neon-g1-v1 --port 8300 --blend linear
Method Path What
POST /predict Predict actions from image + instruction (queues with RTC)
POST /action Pop next action from the RTC queue
POST /reset Clear queue and smoothing state (new episode)
GET /health Server status, model info, queue state

G1 Hardware Controller

The bridge between neural network predictions and a 35kg humanoid:

from neon.inference.g1_controller import G1Controller, G1Config

config = G1Config(
    robot_ip="192.168.123.10",
    control_frequency=50.0,       # 50 Hz — standard for humanoid control
    mode="arms_only",
    max_joint_velocity=2.0,       # Safety: rad/s
    max_linear_velocity=0.5,      # Safety: m/s
    max_angular_velocity=0.5,     # Safety: rad/s
    enable_safety_checks=True,
)

controller = G1Controller(config)
controller.connect()

Safety Limits

Every action passes through safety checks before reaching the robot:

  1. Joint limits — Clipped to each joint's URDF range
  2. Velocity limits — Maximum rad/s to prevent violent motions
  3. Locomotion limits — Maximum m/s and rad/s for walking
  4. Emergency stopcontroller.disconnect() sends zero velocity immediately

Never disable safety checks

The G1 is 1.2 meters tall and weighs 35 kilograms. Uncontrolled joint movements can damage the robot or harm people. enable_safety_checks=True is the only acceptable production setting.


Closed-Loop Control

The full pipeline: observe → predict → act → repeat at 50 Hz:

controller.run_control_loop(
    model=server,
    instruction="Pick up the red cup",
    max_steps=200,                # ~4 seconds at 50 Hz
    action_chunk_index=0,         # Execute first step of each predicted chunk
)

The Loop

sequenceDiagram
    participant Cam as Camera
    participant Ctrl as Controller
    participant Model as Neon Model
    participant Robot as G1 Robot

    loop Every 20ms (50 Hz)
        Cam->>Ctrl: Camera frame
        Ctrl->>Ctrl: Read joint states
        Ctrl->>Model: predict(image, instruction, joints)
        Model->>Ctrl: Action chunk (16 steps)
        Ctrl->>Ctrl: Apply safety limits
        Ctrl->>Robot: Send action[0]
    end

Real-World Timing

  • Model inference: ~50ms on Jetson Orin (3B, 4-bit)
  • Camera capture: ~5ms
  • Communication: ~2ms
  • Total: ~57ms → actual frequency ~17 Hz

With action chunking, the robot has 16 predicted steps to execute while the next prediction computes. Inference latency and control frequency are decoupled. The robot moves smoothly at 50 Hz even though the model runs at 17 Hz.

With Real-Time Chunking, the server automatically skips the 3 stale actions and blends the overlap with the previous chunk — eliminating the jerk at chunk boundaries.


Deployment Targets

Device GPU Backbone Latency Role
Jetson Orin (Thor) 32GB unified 3B, 4-bit ~50ms On-robot, always-on
RTX 4090 24GB 7B, 4-bit ~30ms Desktop workstation
A100 (EC2) 40GB 7B, 4-bit ~20ms Cloud, lowest latency
MacBook M3 MPS 3B, 4-bit ~200ms Development and testing only

Run the model on the robot

For lowest latency, run the inference server directly on the Jetson Orin mounted on the G1. The controller talks to localhost. No network hop. No serialization overhead.


strands-robots Policy Integration

Neon ships as a first-class strands-robots policy. On pip install neon-vla, it auto-registers:

from strands_robots.policies import create_policy

# Auto-discovered — no extra configuration
policy = create_policy("neon", host="192.168.123.10", port=8300)

# Full omni-modal observation
obs = {
    "observation.images.front": camera_frame,   # (H, W, 3) uint8
    "observation.state": joint_positions,        # (17,) float32
    "observation.audio": voice_waveform,         # (16000,) float32
    "observation.lidar": point_cloud,            # (4096, 4) float32
    "observation.eef_state": ee_state,           # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")

The NeonPolicy bridges VLA inference frequency (~5-10 Hz) to robot control frequency (50 Hz) via an RTC action queue with three blend schedules:

Schedule How Best For
linear Linearly interpolate from old → new chunk Smooth motion (default)
step Hard switch at midpoint Fast response
exponential Exponential decay blend Soft transitions

Auto-discover server capabilities:

health = await policy.health_check()
print(health["modalities"])
# {"camera": true, "audio": true, "lidar": false, "eef_state": false, "speech_output": true}

HTTP API Reference

The inference server accepts all 6 modalities via /predict:

Field Type Description
image_base64 string Base64-encoded PNG/JPEG
video_frames_base64 string[] List of base64-encoded frames
instruction string Natural language command
proprioception number[] Joint positions
audio number[] 16kHz float waveform
audio_base64 string Base64 PCM-16 little-endian
lidar number[][] Point cloud (N×4: x,y,z,intensity)
eef_state number[] Bimanual EE pos+quat (14-DOF)
speak boolean Trigger PersonaPlex speech response
rtc boolean Use RTC action queue (default: true)

The /health endpoint reports which modalities the loaded model supports.


Next: Strands Agent Tool — when an LLM meets a robot body