Inference & Hardware Control¶
From prediction to physical movement. The inference server, the G1 controller, and the closed-loop that ties them together.
Inference Server¶
Load a trained model, predict actions, serve them fast:
from neon.inference.server import NeonInferenceServer
server = NeonInferenceServer(
model_path="cagataydev/neon-g1-v1", # HuggingFace ID or local path
device="auto", # auto: CUDA > MPS > CPU
)
output = server.predict(
image=camera_frame,
instruction="Pick up the red cup",
proprioception=joint_states,
smooth=True, # Temporal EMA smoothing
)
Temporal Smoothing¶
The server supports two modes for handling chunk transitions:
RTC mode (default, recommended): Delay-aware action queue with prefix blending. See Real-Time Chunking for the full explanation.
server = NeonInferenceServer(
model_path="cagataydev/neon-g1-v1",
blend_schedule="linear", # or "exp", "ema", "latest"
execution_horizon=10,
)
# RTC control loop
output = server.predict(image=frame, instruction=task, proprioception=joints, rtc=True)
action = server.get_action() # delay-aware, blended
Legacy EMA mode: Simple exponential moving average (Neon v1 behavior):
output = server.predict(..., smooth=True, rtc=False)
# α = 0.7 by default
# new_actions = 0.7 * predicted + 0.3 * previous_actions
HTTP Server¶
Run inference as an HTTP endpoint for remote control:
| Method | Path | What |
|---|---|---|
POST |
/predict |
Predict actions from image + instruction (queues with RTC) |
POST |
/action |
Pop next action from the RTC queue |
POST |
/reset |
Clear queue and smoothing state (new episode) |
GET |
/health |
Server status, model info, queue state |
G1 Hardware Controller¶
The bridge between neural network predictions and a 35kg humanoid:
from neon.inference.g1_controller import G1Controller, G1Config
config = G1Config(
robot_ip="192.168.123.10",
control_frequency=50.0, # 50 Hz — standard for humanoid control
mode="arms_only",
max_joint_velocity=2.0, # Safety: rad/s
max_linear_velocity=0.5, # Safety: m/s
max_angular_velocity=0.5, # Safety: rad/s
enable_safety_checks=True,
)
controller = G1Controller(config)
controller.connect()
Safety Limits¶
Every action passes through safety checks before reaching the robot:
- Joint limits — Clipped to each joint's URDF range
- Velocity limits — Maximum rad/s to prevent violent motions
- Locomotion limits — Maximum m/s and rad/s for walking
- Emergency stop —
controller.disconnect()sends zero velocity immediately
Never disable safety checks
The G1 is 1.2 meters tall and weighs 35 kilograms. Uncontrolled joint movements can damage the robot or harm people. enable_safety_checks=True is the only acceptable production setting.
Closed-Loop Control¶
The full pipeline: observe → predict → act → repeat at 50 Hz:
controller.run_control_loop(
model=server,
instruction="Pick up the red cup",
max_steps=200, # ~4 seconds at 50 Hz
action_chunk_index=0, # Execute first step of each predicted chunk
)
The Loop¶
sequenceDiagram
participant Cam as Camera
participant Ctrl as Controller
participant Model as Neon Model
participant Robot as G1 Robot
loop Every 20ms (50 Hz)
Cam->>Ctrl: Camera frame
Ctrl->>Ctrl: Read joint states
Ctrl->>Model: predict(image, instruction, joints)
Model->>Ctrl: Action chunk (16 steps)
Ctrl->>Ctrl: Apply safety limits
Ctrl->>Robot: Send action[0]
end
Real-World Timing¶
- Model inference: ~50ms on Jetson Orin (3B, 4-bit)
- Camera capture: ~5ms
- Communication: ~2ms
- Total: ~57ms → actual frequency ~17 Hz
With action chunking, the robot has 16 predicted steps to execute while the next prediction computes. Inference latency and control frequency are decoupled. The robot moves smoothly at 50 Hz even though the model runs at 17 Hz.
With Real-Time Chunking, the server automatically skips the 3 stale actions and blends the overlap with the previous chunk — eliminating the jerk at chunk boundaries.
Deployment Targets¶
| Device | GPU | Backbone | Latency | Role |
|---|---|---|---|---|
| Jetson Orin (Thor) | 32GB unified | 3B, 4-bit | ~50ms | On-robot, always-on |
| RTX 4090 | 24GB | 7B, 4-bit | ~30ms | Desktop workstation |
| A100 (EC2) | 40GB | 7B, 4-bit | ~20ms | Cloud, lowest latency |
| MacBook M3 | MPS | 3B, 4-bit | ~200ms | Development and testing only |
Run the model on the robot
For lowest latency, run the inference server directly on the Jetson Orin mounted on the G1. The controller talks to localhost. No network hop. No serialization overhead.
strands-robots Policy Integration¶
Neon ships as a first-class strands-robots policy. On pip install neon-vla, it auto-registers:
from strands_robots.policies import create_policy
# Auto-discovered — no extra configuration
policy = create_policy("neon", host="192.168.123.10", port=8300)
# Full omni-modal observation
obs = {
"observation.images.front": camera_frame, # (H, W, 3) uint8
"observation.state": joint_positions, # (17,) float32
"observation.audio": voice_waveform, # (16000,) float32
"observation.lidar": point_cloud, # (4096, 4) float32
"observation.eef_state": ee_state, # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")
The NeonPolicy bridges VLA inference frequency (~5-10 Hz) to robot control frequency (50 Hz) via an RTC action queue with three blend schedules:
| Schedule | How | Best For |
|---|---|---|
linear |
Linearly interpolate from old → new chunk | Smooth motion (default) |
step |
Hard switch at midpoint | Fast response |
exponential |
Exponential decay blend | Soft transitions |
Auto-discover server capabilities:
health = await policy.health_check()
print(health["modalities"])
# {"camera": true, "audio": true, "lidar": false, "eef_state": false, "speech_output": true}
HTTP API Reference¶
The inference server accepts all 6 modalities via /predict:
| Field | Type | Description |
|---|---|---|
image_base64 |
string | Base64-encoded PNG/JPEG |
video_frames_base64 |
string[] | List of base64-encoded frames |
instruction |
string | Natural language command |
proprioception |
number[] | Joint positions |
audio |
number[] | 16kHz float waveform |
audio_base64 |
string | Base64 PCM-16 little-endian |
lidar |
number[][] | Point cloud (N×4: x,y,z,intensity) |
eef_state |
number[] | Bimanual EE pos+quat (14-DOF) |
speak |
boolean | Trigger PersonaPlex speech response |
rtc |
boolean | Use RTC action queue (default: true) |
The /health endpoint reports which modalities the loaded model supports.
→ Next: Strands Agent Tool — when an LLM meets a robot body