Quickstart¶
From zero to a robot that predicts its own future — in five steps.
1. Create a Model¶
from neon.model.neon_vla import NeonConfig, NeonVLA
from neon.model.video_backbone import BackboneConfig
config = NeonConfig(
backbone=BackboneConfig(
model_id="Qwen/Qwen2.5-VL-3B-Instruct", # 3B for fast iteration
load_in_4bit=True, # Quantize — less VRAM, same quality
freeze_backbone=True, # Only train action heads
),
control_mode="arms_only", # 14 DoF arms + grippers
include_locomotion=True, # + 3 DoF velocity (vx, vy, ω)
num_action_steps=16, # Predict 16 future timesteps
use_action_chunking=True, # Smooth temporal predictions
use_separate_heads=True, # One MLP per joint group
)
model = NeonVLA(config)
print(model)
# NeonVLA(
# backbone=Qwen/Qwen2.5-VL-3B-Instruct,
# action_space=G1ActionSpace(mode=arms_only, joints=14, action_dim=17),
# action_chunking=True (steps=16),
# trainable=2,345,678 / 3,500,000,000 params
# )
The backbone loads lazily
The 3B backbone isn't downloaded until you call model.load_backbone(). Model creation is instant — design your config, iterate fast, load weights when you're ready.
2. Load the Backbone and Predict¶
from PIL import Image
import numpy as np
# Load the video backbone (downloads from HuggingFace on first call)
model.load_backbone()
# Simulate a camera frame and current joint states
camera_frame = Image.new("RGB", (640, 480), color="gray")
joint_states = model.action_space.default_position() # (17,) — standing pose
# Predict the next 16 timesteps
output = model.predict(
image=camera_frame,
instruction="Pick up the red cup",
proprioception=joint_states,
audio=voice_waveform, # Optional: 16kHz spoken command
lidar=point_cloud, # Optional: (N, 4) LiDAR points
eef_state=ee_positions, # Optional: (14,) bimanual EE pos+quat
speak=True, # Optional: robot speaks back
)
print(output.actions.shape) # (16, 17) — 16 future steps × 17 DoF
print(output.raw_actions.shape) # (16, 17) — denormalized to real joint angles
print(output.locomotion) # (16, 3) — vx, vy, omega per step
print(output.speech_path) # "/tmp/neon_speech_xyz.wav" (if speak=True)
That's the core loop. One observation (any combination of the 6 modalities), one instruction → 16 timesteps of the robot's future, predicted in a single forward pass.
3. Understand the Output¶
# Normalized actions live in [-1, 1]
print(f"Action range: [{output.actions.min():.2f}, {output.actions.max():.2f}]")
# raw_actions are in physical units (radians for joints, m/s for locomotion)
print(f"First step joint angles: {output.raw_actions[0, :5]}")
# Split by joint group for inspection
split = model.action_space.split_action(output.raw_actions)
print(f"Left arm: {split['left_arm'].shape}") # (16, 7) — shoulder + elbow + wrist + gripper
print(f"Right arm: {split['right_arm'].shape}") # (16, 7)
print(f"Locomotion: {split['locomotion'].shape}") # (16, 3) — forward, lateral, rotation
The split_action method is your window into what each joint group is doing. Each group has its own MLP decoder, its own loss, its own story.
4. Run on the Real Robot¶
from neon.inference.g1_controller import G1Controller, G1Config
# Connect to the Unitree G1
controller = G1Controller(G1Config(
robot_ip="192.168.123.10",
mode="arms_only",
max_joint_velocity=2.0, # Safety: rad/s — the robot won't move faster than this
))
controller.connect()
# Closed-loop: observe → predict → act → observe → predict → act → ...
controller.run_control_loop(
model=model,
instruction="Pick up the red cup",
max_steps=200, # ~4 seconds at 50 Hz
)
controller.disconnect()
Safety first
The G1 is a 1.2m, 35kg humanoid. Joint limits, velocity limits, and emergency stops are enforced by default. Never disable safety checks. Start with arms_only mode, low velocity limits, and a clear workspace.
5. Use as a Strands Agent Tool¶
This is where it gets interesting. An LLM becomes the high-level planner, and Neon becomes the hands:
from strands import Agent
from neon.tools import neon_tool
agent = Agent(tools=[neon_tool])
# The agent decomposes, plans, and executes — Neon handles the motor control
agent("Pick up the red cup and place it on the shelf")
The agent decides what to do. Neon decides how to move. System 2 meets System 1.
What's Next?¶
You've seen the full loop. Now go deeper:
- Why Video Models — the insight that makes this work
- Training — train your own Neon on any GPU
- Action Space — every joint in the G1's body
- Audio In/Out — give the robot ears and a voice