Omni-Modal Streaming¶
Neon treats the robot as a continuous learning machine, not just a policy executor. Every sensor, every microphone sample, every agent decision streams through typed channels — simultaneously recorded for training and used for real-time inference.
The Streaming Architecture¶
┌─────────── G1 on Gantry ─────────────────────────────┐
│ │
│ Joints @500Hz ──┐ │
│ LiDAR @10Hz ──┤ │
│ Camera @30Hz ──┼──→ StreamRecorder → HuggingFace │
│ Audio @16kHz ──┤ │ │
│ Text (event)──┤ ▼ │
│ Tools (event)──┘ NeonVLA → Actions → G1 SDK │
│ │
│ ─── all channels on Zenoh mesh ─── │
└────────────────────────────────────────────────────────┘
Every channel has a schema, a ring buffer, a Zenoh publisher, and a training serializer. When the G1 is running — on gantry, doing teleop, or executing VLA inference — everything is recorded.
Channels¶
| Channel | Rate | Data | Training Key |
|---|---|---|---|
| Joints | 500 Hz | 29 positions + 29 velocities + 29 torques | observation.state |
| LiDAR | 10 Hz | 21,600 points × (x, y, z, intensity) | observation.lidar |
| Camera | 30 Hz | Stereo 640×480 RGB | observation.image.* |
| Audio | 16 kHz | Mono PCM-16 (microphone in + speaker out) | observation.audio |
| Text | Event | Instructions, ASR transcripts, agent thoughts | language_instruction |
| Tool Calls | Event | Agent decisions (tool name, args, result) | tool_calls |
Quick Start¶
from neon.streams import StreamSession
# Connect to G1 and start streaming everything
session = StreamSession.from_robot(
robot_ip="192.168.123.10",
hub_id="cagataydev/neon-g1-data",
model_path="cagataydev/neon-v1",
)
session.start()
# Give a voice command (or just speak — the mic is always on)
session.instruct("Pick up the red cup")
# Everything is being recorded to HuggingFace
# Everything is being streamed to Zenoh
# The VLA is running inference at 4Hz
session.stop() # Exports dataset + pushes to HuggingFace
What Gets Recorded¶
Each recording session produces a LeRobot-compatible dataset:
neon_recordings/
├── data/chunk-000/
│ ├── episode_000000.parquet # All modalities, 4fps
│ ├── episode_000001.parquet
│ └── ...
├── meta/
│ ├── info.json # Dataset metadata
│ ├── episodes.jsonl # Per-episode stats
│ └── tasks.jsonl
└── videos/
└── observation.image.camera_head/
├── episode_000000.mp4
└── ...
Episodes auto-segment every 60 seconds, or manually via session.instruct("new task").
Voice Loop¶
The audio channel creates a closed loop:
You speak → Microphone @16kHz → Whisper ASR → Text instruction
→ NeonVLA (audio + vision + joints) → Actions
→ Robot moves + PersonaPlex speaks back → Speaker output
→ Both directions recorded for training
The model learns from paired audio-action data: "I heard 'pick up the cup' while seeing the cup at (x,y,z), so I moved the arm to (x,y,z)."
LiDAR for Scene Understanding¶
The L1 LiDAR provides 360° 3D point clouds at 10Hz. During training, these are:
- Downsampled to 4,096 points per scan
- Encoded as spatial features alongside camera images
- Used for obstacle avoidance and workspace mapping
On gantry, LiDAR continuously maps the room — this becomes the digital twin mesh for MuJoCo simulation.
Tool Call Recording¶
Every Strands agent decision is recorded:
# The agent decides to call a tool
session.tool_calls.push_tool_call(
tool_name="robot.move_to",
arguments={"x": 0.5, "y": 0.2, "z": 0.3},
result="success",
duration_ms=150,
)
This teaches the next model when to use tools — not just motor control, but high-level planning decisions.
Zenoh Mesh¶
All channels publish to Zenoh topics:
| Channel | Zenoh Topic |
|---|---|
| Joints | neon/joints |
| Camera | neon/camera_head |
| LiDAR | neon/lidar |
| Audio | neon/audio |
| Text | neon/text |
Any device on the network can subscribe. This enables:
- Teleop monitoring from MacBook/Quest 3
- Multi-robot coordination via shared state
- Browser visualization via WebSocket bridge
- Remote inference on a GPU server
Training Data Pipeline¶
G1 on Gantry (always recording)
↓
StreamRecorder → LeRobot parquet + MP4
↓
HuggingFace Hub (private dataset)
↓
DataSoup (mixed with Agibot, OXE, DreamGen, voice commands)
↓
NeonVLA training on HuggingFace A100
↓
New checkpoint → deployed back to G1
↓
Better actions → better recordings → repeat
The G1 gets smarter every cycle. Every hour of gantry time becomes training data for the next model version.