Skip to content

Omni-Modal Streaming

Neon treats the robot as a continuous learning machine, not just a policy executor. Every sensor, every microphone sample, every agent decision streams through typed channels — simultaneously recorded for training and used for real-time inference.

The Streaming Architecture

┌─────────── G1 on Gantry ─────────────────────────────┐
│                                                        │
│  Joints @500Hz ──┐                                     │
│  LiDAR  @10Hz  ──┤                                     │
│  Camera @30Hz  ──┼──→ StreamRecorder → HuggingFace     │
│  Audio  @16kHz ──┤         │                           │
│  Text   (event)──┤         ▼                           │
│  Tools  (event)──┘    NeonVLA → Actions → G1 SDK       │
│                                                        │
│  ─── all channels on Zenoh mesh ───                    │
└────────────────────────────────────────────────────────┘

Every channel has a schema, a ring buffer, a Zenoh publisher, and a training serializer. When the G1 is running — on gantry, doing teleop, or executing VLA inference — everything is recorded.

Channels

Channel Rate Data Training Key
Joints 500 Hz 29 positions + 29 velocities + 29 torques observation.state
LiDAR 10 Hz 21,600 points × (x, y, z, intensity) observation.lidar
Camera 30 Hz Stereo 640×480 RGB observation.image.*
Audio 16 kHz Mono PCM-16 (microphone in + speaker out) observation.audio
Text Event Instructions, ASR transcripts, agent thoughts language_instruction
Tool Calls Event Agent decisions (tool name, args, result) tool_calls

Quick Start

from neon.streams import StreamSession

# Connect to G1 and start streaming everything
session = StreamSession.from_robot(
    robot_ip="192.168.123.10",
    hub_id="cagataydev/neon-g1-data",
    model_path="cagataydev/neon-v1",
)
session.start()

# Give a voice command (or just speak — the mic is always on)
session.instruct("Pick up the red cup")

# Everything is being recorded to HuggingFace
# Everything is being streamed to Zenoh
# The VLA is running inference at 4Hz

session.stop()  # Exports dataset + pushes to HuggingFace

What Gets Recorded

Each recording session produces a LeRobot-compatible dataset:

neon_recordings/
├── data/chunk-000/
│   ├── episode_000000.parquet   # All modalities, 4fps
│   ├── episode_000001.parquet
│   └── ...
├── meta/
│   ├── info.json                # Dataset metadata
│   ├── episodes.jsonl           # Per-episode stats
│   └── tasks.jsonl
└── videos/
    └── observation.image.camera_head/
        ├── episode_000000.mp4
        └── ...

Episodes auto-segment every 60 seconds, or manually via session.instruct("new task").

Voice Loop

The audio channel creates a closed loop:

You speak → Microphone @16kHz → Whisper ASR → Text instruction
    → NeonVLA (audio + vision + joints) → Actions
    → Robot moves + PersonaPlex speaks back → Speaker output
    → Both directions recorded for training

The model learns from paired audio-action data: "I heard 'pick up the cup' while seeing the cup at (x,y,z), so I moved the arm to (x,y,z)."

LiDAR for Scene Understanding

The L1 LiDAR provides 360° 3D point clouds at 10Hz. During training, these are:

  1. Downsampled to 4,096 points per scan
  2. Encoded as spatial features alongside camera images
  3. Used for obstacle avoidance and workspace mapping

On gantry, LiDAR continuously maps the room — this becomes the digital twin mesh for MuJoCo simulation.

Tool Call Recording

Every Strands agent decision is recorded:

# The agent decides to call a tool
session.tool_calls.push_tool_call(
    tool_name="robot.move_to",
    arguments={"x": 0.5, "y": 0.2, "z": 0.3},
    result="success",
    duration_ms=150,
)

This teaches the next model when to use tools — not just motor control, but high-level planning decisions.

Zenoh Mesh

All channels publish to Zenoh topics:

Channel Zenoh Topic
Joints neon/joints
Camera neon/camera_head
LiDAR neon/lidar
Audio neon/audio
Text neon/text

Any device on the network can subscribe. This enables:

  • Teleop monitoring from MacBook/Quest 3
  • Multi-robot coordination via shared state
  • Browser visualization via WebSocket bridge
  • Remote inference on a GPU server

Training Data Pipeline

G1 on Gantry (always recording)
StreamRecorder → LeRobot parquet + MP4
HuggingFace Hub (private dataset)
DataSoup (mixed with Agibot, OXE, DreamGen, voice commands)
NeonVLA training on HuggingFace A100
New checkpoint → deployed back to G1
Better actions → better recordings → repeat

The G1 gets smarter every cycle. Every hour of gantry time becomes training data for the next model version.