Omni-Modal Streaming¶

Neon treats the robot as a continuous learning machine, not just a policy executor. Every sensor, every microphone sample, every agent decision streams through typed channels — simultaneously recorded for training and used for real-time inference.

The Streaming Architecture¶

┌─────────── G1 on Gantry ─────────────────────────────┐
│                                                        │
│  Joints @500Hz ──┐                                     │
│  LiDAR  @10Hz  ──┤                                     │
│  Camera @30Hz  ──┼──→ StreamRecorder → HuggingFace     │
│  Audio  @16kHz ──┤         │                           │
│  Text   (event)──┤         ▼                           │
│  Tools  (event)──┘    NeonVLA → Actions → G1 SDK       │
│                                                        │
│  ─── all channels on Zenoh mesh ───                    │
└────────────────────────────────────────────────────────┘

Every channel has a schema, a ring buffer, a Zenoh publisher, and a training serializer. When the G1 is running — on gantry, doing teleop, or executing VLA inference — everything is recorded.

Channels¶

Channel	Rate	Data	Training Key
Joints	500 Hz	29 positions + 29 velocities + 29 torques	`observation.state`
LiDAR	10 Hz	21,600 points × (x, y, z, intensity)	`observation.lidar`
Camera	30 Hz	Stereo 640×480 RGB	`observation.image.*`
Audio	16 kHz	Mono PCM-16 (microphone in + speaker out)	`observation.audio`
Text	Event	Instructions, ASR transcripts, agent thoughts	`language_instruction`
Tool Calls	Event	Agent decisions (tool name, args, result)	`tool_calls`

Quick Start¶

from neon.streams import StreamSession

# Connect to G1 and start streaming everything
session = StreamSession.from_robot(
    robot_ip="192.168.123.10",
    hub_id="cagataydev/neon-g1-data",
    model_path="cagataydev/neon-v1",
)
session.start()

# Give a voice command (or just speak — the mic is always on)
session.instruct("Pick up the red cup")

# Everything is being recorded to HuggingFace
# Everything is being streamed to Zenoh
# The VLA is running inference at 4Hz

session.stop()  # Exports dataset + pushes to HuggingFace

What Gets Recorded¶

Each recording session produces a LeRobot-compatible dataset:

neon_recordings/
├── data/chunk-000/
│   ├── episode_000000.parquet   # All modalities, 4fps
│   ├── episode_000001.parquet
│   └── ...
├── meta/
│   ├── info.json                # Dataset metadata
│   ├── episodes.jsonl           # Per-episode stats
│   └── tasks.jsonl
└── videos/
    └── observation.image.camera_head/
        ├── episode_000000.mp4
        └── ...

Episodes auto-segment every 60 seconds, or manually via session.instruct("new task").

Voice Loop¶

The audio channel creates a closed loop:

You speak → Microphone @16kHz → Whisper ASR → Text instruction
    → NeonVLA (audio + vision + joints) → Actions
    → Robot moves + PersonaPlex speaks back → Speaker output
    → Both directions recorded for training

The model learns from paired audio-action data: "I heard 'pick up the cup' while seeing the cup at (x,y,z), so I moved the arm to (x,y,z)."

LiDAR for Scene Understanding¶

The L1 LiDAR provides 360° 3D point clouds at 10Hz. During training, these are:

Downsampled to 4,096 points per scan
Encoded as spatial features alongside camera images
Used for obstacle avoidance and workspace mapping

On gantry, LiDAR continuously maps the room — this becomes the digital twin mesh for MuJoCo simulation.

Tool Call Recording¶

Every Strands agent decision is recorded:

# The agent decides to call a tool
session.tool_calls.push_tool_call(
    tool_name="robot.move_to",
    arguments={"x": 0.5, "y": 0.2, "z": 0.3},
    result="success",
    duration_ms=150,
)

This teaches the next model when to use tools — not just motor control, but high-level planning decisions.

Zenoh Mesh¶

All channels publish to Zenoh topics:

Channel	Zenoh Topic
Joints	`neon/joints`
Camera	`neon/camera_head`
LiDAR	`neon/lidar`
Audio	`neon/audio`
Text	`neon/text`

Any device on the network can subscribe. This enables:

Teleop monitoring from MacBook/Quest 3
Multi-robot coordination via shared state
Browser visualization via WebSocket bridge
Remote inference on a GPU server

Training Data Pipeline¶

G1 on Gantry (always recording)
    ↓
StreamRecorder → LeRobot parquet + MP4
    ↓
HuggingFace Hub (private dataset)
    ↓
DataSoup (mixed with Agibot, OXE, DreamGen, voice commands)
    ↓
NeonVLA training on HuggingFace A100
    ↓
New checkpoint → deployed back to G1
    ↓
Better actions → better recordings → repeat

The G1 gets smarter every cycle. Every hour of gantry time becomes training data for the next model version.