Inference & Hardware Control¶

From prediction to physical movement. The inference server, the G1 controller, and the closed-loop that ties them together.

Inference Server¶

Load a trained model, predict actions, serve them fast:

from neon.inference.server import NeonInferenceServer

server = NeonInferenceServer(
    model_path="cagataydev/neon-g1-v1",  # HuggingFace ID or local path
    device="auto",                         # auto: CUDA > MPS > CPU
)

output = server.predict(
    image=camera_frame,
    instruction="Pick up the red cup",
    proprioception=joint_states,
    smooth=True,                           # Temporal EMA smoothing
)

Temporal Smoothing¶

The server supports two modes for handling chunk transitions:

RTC mode (default, recommended): Delay-aware action queue with prefix blending. See Real-Time Chunking for the full explanation.

server = NeonInferenceServer(
    model_path="cagataydev/neon-g1-v1",
    blend_schedule="linear",    # or "exp", "ema", "latest"
    execution_horizon=10,
)

# RTC control loop
output = server.predict(image=frame, instruction=task, proprioception=joints, rtc=True)
action = server.get_action()  # delay-aware, blended

Legacy EMA mode: Simple exponential moving average (Neon v1 behavior):

output = server.predict(..., smooth=True, rtc=False)
# α = 0.7 by default
# new_actions = 0.7 * predicted + 0.3 * previous_actions

HTTP Server¶

Run inference as an HTTP endpoint for remote control:

python -m neon.inference.server --model cagataydev/neon-g1-v1 --port 8300 --blend linear

Method	Path	What
`POST`	`/predict`	Predict actions from image + instruction (queues with RTC)
`POST`	`/action`	Pop next action from the RTC queue
`POST`	`/reset`	Clear queue and smoothing state (new episode)
`GET`	`/health`	Server status, model info, queue state

G1 Hardware Controller¶

The bridge between neural network predictions and a 35kg humanoid:

from neon.inference.g1_controller import G1Controller, G1Config

config = G1Config(
    robot_ip="192.168.123.10",
    control_frequency=50.0,       # 50 Hz — standard for humanoid control
    mode="arms_only",
    max_joint_velocity=2.0,       # Safety: rad/s
    max_linear_velocity=0.5,      # Safety: m/s
    max_angular_velocity=0.5,     # Safety: rad/s
    enable_safety_checks=True,
)

controller = G1Controller(config)
controller.connect()

Safety Limits¶

Every action passes through safety checks before reaching the robot:

Joint limits — Clipped to each joint's URDF range
Velocity limits — Maximum rad/s to prevent violent motions
Locomotion limits — Maximum m/s and rad/s for walking
Emergency stop — controller.disconnect() sends zero velocity immediately

Never disable safety checks

The G1 is 1.2 meters tall and weighs 35 kilograms. Uncontrolled joint movements can damage the robot or harm people. enable_safety_checks=True is the only acceptable production setting.

Closed-Loop Control¶

The full pipeline: observe → predict → act → repeat at 50 Hz:

controller.run_control_loop(
    model=server,
    instruction="Pick up the red cup",
    max_steps=200,                # ~4 seconds at 50 Hz
    action_chunk_index=0,         # Execute first step of each predicted chunk
)

The Loop¶

sequenceDiagram
    participant Cam as Camera
    participant Ctrl as Controller
    participant Model as Neon Model
    participant Robot as G1 Robot

    loop Every 20ms (50 Hz)
        Cam->>Ctrl: Camera frame
        Ctrl->>Ctrl: Read joint states
        Ctrl->>Model: predict(image, instruction, joints)
        Model->>Ctrl: Action chunk (16 steps)
        Ctrl->>Ctrl: Apply safety limits
        Ctrl->>Robot: Send action[0]
    end

Real-World Timing¶

Model inference: ~50ms on Jetson Orin (3B, 4-bit)
Camera capture: ~5ms
Communication: ~2ms
Total: ~57ms → actual frequency ~17 Hz

With action chunking, the robot has 16 predicted steps to execute while the next prediction computes. Inference latency and control frequency are decoupled. The robot moves smoothly at 50 Hz even though the model runs at 17 Hz.

With Real-Time Chunking, the server automatically skips the 3 stale actions and blends the overlap with the previous chunk — eliminating the jerk at chunk boundaries.

Deployment Targets¶

Device	GPU	Backbone	Latency	Role
Jetson Orin (Thor)	32GB unified	3B, 4-bit	~50ms	On-robot, always-on
RTX 4090	24GB	7B, 4-bit	~30ms	Desktop workstation
A100 (EC2)	40GB	7B, 4-bit	~20ms	Cloud, lowest latency
MacBook M3	MPS	3B, 4-bit	~200ms	Development and testing only

Run the model on the robot

For lowest latency, run the inference server directly on the Jetson Orin mounted on the G1. The controller talks to localhost. No network hop. No serialization overhead.

strands-robots Policy Integration¶

Neon ships as a first-class strands-robots policy. On pip install neon-vla, it auto-registers:

from strands_robots.policies import create_policy

# Auto-discovered — no extra configuration
policy = create_policy("neon", host="192.168.123.10", port=8300)

# Full omni-modal observation
obs = {
    "observation.images.front": camera_frame,   # (H, W, 3) uint8
    "observation.state": joint_positions,        # (17,) float32
    "observation.audio": voice_waveform,         # (16000,) float32
    "observation.lidar": point_cloud,            # (4096, 4) float32
    "observation.eef_state": ee_state,           # (14,) float32
}
actions = policy.get_actions_sync(obs, "pick up the red cup")

The NeonPolicy bridges VLA inference frequency (~5-10 Hz) to robot control frequency (50 Hz) via an RTC action queue with three blend schedules:

Schedule	How	Best For
`linear`	Linearly interpolate from old → new chunk	Smooth motion (default)
`step`	Hard switch at midpoint	Fast response
`exponential`	Exponential decay blend	Soft transitions

Auto-discover server capabilities:

health = await policy.health_check()
print(health["modalities"])
# {"camera": true, "audio": true, "lidar": false, "eef_state": false, "speech_output": true}

HTTP API Reference¶

The inference server accepts all 6 modalities via /predict:

Field	Type	Description
`image_base64`	string	Base64-encoded PNG/JPEG
`video_frames_base64`	string[]	List of base64-encoded frames
`instruction`	string	Natural language command
`proprioception`	number[]	Joint positions
`audio`	number[]	16kHz float waveform
`audio_base64`	string	Base64 PCM-16 little-endian
`lidar`	number[][]	Point cloud (N×4: x,y,z,intensity)
`eef_state`	number[]	Bimanual EE pos+quat (14-DOF)
`speak`	boolean	Trigger PersonaPlex speech response
`rtc`	boolean	Use RTC action queue (default: true)

The /health endpoint reports which modalities the loaded model supports.

→ Next: Strands Agent Tool — when an LLM meets a robot body