Audio In/Out¶

The robot hears. The robot speaks. Not as an afterthought — as a first-class modality woven into the same pipeline that controls its body.

Why Audio Matters¶

Text instructions work for benchmarks. Real humans talk to robots.

"Hey, grab that blue thing over there" — spoken while pointing
"Stop!" — urgent, needs zero-latency processing
"What do you see?" — the robot should answer, out loud

Audio is also information. The sound of a motor stalling. Glass breaking. A timer beeping. The world talks to us in frequencies that text can never capture.

Architecture¶

graph TD
    subgraph Audio Input
        MIC["Microphone<br/>16kHz waveform"] --> WE["Whisper Encoder<br/>(frozen, ~39M)"]
        WE --> PROJ["Projection<br/>audio_dim → backbone_dim"]
    end

    subgraph Visual + Language
        CAM["Camera"] --> VB["Video Backbone"]
        TXT["📝 Text"] --> VB
    end

    subgraph Fusion
        PROJ --> FUS["Feature Fusion<br/>(concat + linear)"]
        VB --> FUS
        PE["Proprioception"] --> FUS
    end

    subgraph Output
        FUS --> AH["Action Heads → Joints"]
        FUS --> SH["Speech Head → PersonaPlex"]
    end

    style WE fill:#1565c0,color:#fff
    style VB fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff

The audio features fuse with visual and proprioceptive features before action decoding. The robot doesn't just hear — it integrates what it hears with what it sees and where it is, then decides what to do and what to say.

Three Strategies for Listening¶

1. Whisper Encoder (Default)¶

OpenAI's Whisper, but only the encoder half. We extract features, not transcriptions:

from neon.model.audio import AudioConfig, AudioEncoder

config = AudioConfig(
    encoder_type="whisper",
    whisper_model="openai/whisper-base",  # 39M params — robust to noise and accents
    freeze_encoder=True,                   # No training cost
    sample_rate=16000,
)

encoder = AudioEncoder(config, backbone_hidden_size=2048)
features = encoder.encode(audio_tensor)  # (1, 2048) — same dim as backbone

Why Whisper? Pre-trained on 680K hours of speech. Understands accents, background noise, real-world audio. Frozen = zero additional training cost.

2. Mel Spectrogram CNN (Lightweight)¶

For edge deployment where even Whisper-base is too heavy — a tiny CNN (~200K params) that runs in 1ms on Jetson:

config = AudioConfig(encoder_type="mel", audio_hidden_size=512)

3. Omni Mode (Native)¶

When using Qwen2.5-Omni as the backbone, audio is processed natively alongside video in the same attention mechanism — no separate encoder needed:

config = AudioConfig(encoder_type="omni")

This is the highest-quality option. The backbone fuses audio, video, and text in a single pass.

Speaking: PersonaPlex TTS¶

The robot speaks using PersonaPlex, running on the Jetson Orin. Eight voices give the robot personality:

Voice	Type	Character
`NATM0`–`NATM3`	Natural male	Default — calm, clear
`NATF0`–`NATF3`	Natural female	Alternative — warm, precise

The Speech Response Head¶

The model predicts when and what to say from the same features used for actions:

Type	Example	Trigger
`narrate`	"I see the red cup. Reaching for it now."	During execution
`confirm`	"Got it. Working on it."	After receiving a command
`warn`	"I can't reach that safely."	Safety concern detected
`ask`	"Which cup do you mean?"	Ambiguous instruction
`silent`	—	Nothing needs saying

Usage¶

output = model.predict(
    image=camera_frame,
    instruction="Pick up the cup",
    audio=microphone_waveform,  # Spoken command (numpy, 16kHz)
    speak=True,                  # Generate speech response
)

output.actions      # → Joint commands (as usual)
output.speech_path  # → "/tmp/neon_speech_xxx.wav" (played on robot's speaker)

Training Data¶

Dataset	Examples	Content
`cagataydev/vlm-voice-audio`	1,000	PersonaPlex-synthesized spoken commands in 8 voices
`cagataydev/vlm-voice-commands`	50,000	Robot commands across 10 categories

The 10 Categories¶

Category	Count	Example
`pick_place`	222	"Pick up the black jar and place it on the box"
`manipulation`	149	"Carefully twist the lid off the bottle"
`navigation`	113	"Take the path through the left side of the table"
`multistep`	109	"First grab the sponge, then wipe the counter, then put it away"
`observation`	105	"What color is the object in front of you?"
`spatial`	78	"Trace a line from the sponge to the book"
`household`	77	"Please sweep the floor near the fridge"
`safety`	69	"Stop immediately and back away"
`conversational`	42	"How are you doing today?"
`context_rich`	36	"Remember where you put the screwdriver earlier"

Configuration¶

config = NeonConfig(
    control_mode="arms_only",
    audio={
        "enabled": True,
        "encoder_type": "whisper",
        "whisper_model": "openai/whisper-base",
        "freeze_encoder": True,
        "enable_speech_output": True,
        "personaplex_voice": "NATM1",
        "personaplex_host": "http://192.168.1.151:8400",  # Thor
    },
)

Set audio=None (default) to disable audio entirely. Backward compatible. The robot works fine with text only — audio is additive, not required.

→ Ready to build? Training — train your own Neon model