Skip to content

Audio In/Out

The robot hears. The robot speaks. Not as an afterthought — as a first-class modality woven into the same pipeline that controls its body.


Why Audio Matters

Text instructions work for benchmarks. Real humans talk to robots.

  • "Hey, grab that blue thing over there" — spoken while pointing
  • "Stop!" — urgent, needs zero-latency processing
  • "What do you see?" — the robot should answer, out loud

Audio is also information. The sound of a motor stalling. Glass breaking. A timer beeping. The world talks to us in frequencies that text can never capture.


Architecture

graph TD
    subgraph Audio Input
        MIC["Microphone<br/>16kHz waveform"] --> WE["Whisper Encoder<br/>(frozen, ~39M)"]
        WE --> PROJ["Projection<br/>audio_dim → backbone_dim"]
    end

    subgraph Visual + Language
        CAM["Camera"] --> VB["Video Backbone"]
        TXT["📝 Text"] --> VB
    end

    subgraph Fusion
        PROJ --> FUS["Feature Fusion<br/>(concat + linear)"]
        VB --> FUS
        PE["Proprioception"] --> FUS
    end

    subgraph Output
        FUS --> AH["Action Heads → Joints"]
        FUS --> SH["Speech Head → PersonaPlex"]
    end

    style WE fill:#1565c0,color:#fff
    style VB fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff

The audio features fuse with visual and proprioceptive features before action decoding. The robot doesn't just hear — it integrates what it hears with what it sees and where it is, then decides what to do and what to say.


Three Strategies for Listening

1. Whisper Encoder (Default)

OpenAI's Whisper, but only the encoder half. We extract features, not transcriptions:

from neon.model.audio import AudioConfig, AudioEncoder

config = AudioConfig(
    encoder_type="whisper",
    whisper_model="openai/whisper-base",  # 39M params — robust to noise and accents
    freeze_encoder=True,                   # No training cost
    sample_rate=16000,
)

encoder = AudioEncoder(config, backbone_hidden_size=2048)
features = encoder.encode(audio_tensor)  # (1, 2048) — same dim as backbone

Why Whisper? Pre-trained on 680K hours of speech. Understands accents, background noise, real-world audio. Frozen = zero additional training cost.

2. Mel Spectrogram CNN (Lightweight)

For edge deployment where even Whisper-base is too heavy — a tiny CNN (~200K params) that runs in 1ms on Jetson:

config = AudioConfig(encoder_type="mel", audio_hidden_size=512)

3. Omni Mode (Native)

When using Qwen2.5-Omni as the backbone, audio is processed natively alongside video in the same attention mechanism — no separate encoder needed:

config = AudioConfig(encoder_type="omni")

This is the highest-quality option. The backbone fuses audio, video, and text in a single pass.


Speaking: PersonaPlex TTS

The robot speaks using PersonaPlex, running on the Jetson Orin. Eight voices give the robot personality:

Voice Type Character
NATM0NATM3 Natural male Default — calm, clear
NATF0NATF3 Natural female Alternative — warm, precise

The Speech Response Head

The model predicts when and what to say from the same features used for actions:

Type Example Trigger
narrate "I see the red cup. Reaching for it now." During execution
confirm "Got it. Working on it." After receiving a command
warn "I can't reach that safely." Safety concern detected
ask "Which cup do you mean?" Ambiguous instruction
silent Nothing needs saying

Usage

output = model.predict(
    image=camera_frame,
    instruction="Pick up the cup",
    audio=microphone_waveform,  # Spoken command (numpy, 16kHz)
    speak=True,                  # Generate speech response
)

output.actions      # → Joint commands (as usual)
output.speech_path  # → "/tmp/neon_speech_xxx.wav" (played on robot's speaker)

Training Data

Dataset Examples Content
cagataydev/vlm-voice-audio 1,000 PersonaPlex-synthesized spoken commands in 8 voices
cagataydev/vlm-voice-commands 50,000 Robot commands across 10 categories

The 10 Categories

Category Count Example
pick_place 222 "Pick up the black jar and place it on the box"
manipulation 149 "Carefully twist the lid off the bottle"
navigation 113 "Take the path through the left side of the table"
multistep 109 "First grab the sponge, then wipe the counter, then put it away"
observation 105 "What color is the object in front of you?"
spatial 78 "Trace a line from the sponge to the book"
household 77 "Please sweep the floor near the fridge"
safety 69 "Stop immediately and back away"
conversational 42 "How are you doing today?"
context_rich 36 "Remember where you put the screwdriver earlier"

Configuration

config = NeonConfig(
    control_mode="arms_only",
    audio={
        "enabled": True,
        "encoder_type": "whisper",
        "whisper_model": "openai/whisper-base",
        "freeze_encoder": True,
        "enable_speech_output": True,
        "personaplex_voice": "NATM1",
        "personaplex_host": "http://192.168.1.151:8400",  # Thor
    },
)

Set audio=None (default) to disable audio entirely. Backward compatible. The robot works fine with text only — audio is additive, not required.


Ready to build? Training — train your own Neon model