Audio In/Out¶
The robot hears. The robot speaks. Not as an afterthought — as a first-class modality woven into the same pipeline that controls its body.
Why Audio Matters¶
Text instructions work for benchmarks. Real humans talk to robots.
- "Hey, grab that blue thing over there" — spoken while pointing
- "Stop!" — urgent, needs zero-latency processing
- "What do you see?" — the robot should answer, out loud
Audio is also information. The sound of a motor stalling. Glass breaking. A timer beeping. The world talks to us in frequencies that text can never capture.
Architecture¶
graph TD
subgraph Audio Input
MIC["Microphone<br/>16kHz waveform"] --> WE["Whisper Encoder<br/>(frozen, ~39M)"]
WE --> PROJ["Projection<br/>audio_dim → backbone_dim"]
end
subgraph Visual + Language
CAM["Camera"] --> VB["Video Backbone"]
TXT["📝 Text"] --> VB
end
subgraph Fusion
PROJ --> FUS["Feature Fusion<br/>(concat + linear)"]
VB --> FUS
PE["Proprioception"] --> FUS
end
subgraph Output
FUS --> AH["Action Heads → Joints"]
FUS --> SH["Speech Head → PersonaPlex"]
end
style WE fill:#1565c0,color:#fff
style VB fill:#e65100,color:#fff
style FUS fill:#333,color:#fff
The audio features fuse with visual and proprioceptive features before action decoding. The robot doesn't just hear — it integrates what it hears with what it sees and where it is, then decides what to do and what to say.
Three Strategies for Listening¶
1. Whisper Encoder (Default)¶
OpenAI's Whisper, but only the encoder half. We extract features, not transcriptions:
from neon.model.audio import AudioConfig, AudioEncoder
config = AudioConfig(
encoder_type="whisper",
whisper_model="openai/whisper-base", # 39M params — robust to noise and accents
freeze_encoder=True, # No training cost
sample_rate=16000,
)
encoder = AudioEncoder(config, backbone_hidden_size=2048)
features = encoder.encode(audio_tensor) # (1, 2048) — same dim as backbone
Why Whisper? Pre-trained on 680K hours of speech. Understands accents, background noise, real-world audio. Frozen = zero additional training cost.
2. Mel Spectrogram CNN (Lightweight)¶
For edge deployment where even Whisper-base is too heavy — a tiny CNN (~200K params) that runs in 1ms on Jetson:
3. Omni Mode (Native)¶
When using Qwen2.5-Omni as the backbone, audio is processed natively alongside video in the same attention mechanism — no separate encoder needed:
This is the highest-quality option. The backbone fuses audio, video, and text in a single pass.
Speaking: PersonaPlex TTS¶
The robot speaks using PersonaPlex, running on the Jetson Orin. Eight voices give the robot personality:
| Voice | Type | Character |
|---|---|---|
NATM0–NATM3 |
Natural male | Default — calm, clear |
NATF0–NATF3 |
Natural female | Alternative — warm, precise |
The Speech Response Head¶
The model predicts when and what to say from the same features used for actions:
| Type | Example | Trigger |
|---|---|---|
narrate |
"I see the red cup. Reaching for it now." | During execution |
confirm |
"Got it. Working on it." | After receiving a command |
warn |
"I can't reach that safely." | Safety concern detected |
ask |
"Which cup do you mean?" | Ambiguous instruction |
silent |
— | Nothing needs saying |
Usage¶
output = model.predict(
image=camera_frame,
instruction="Pick up the cup",
audio=microphone_waveform, # Spoken command (numpy, 16kHz)
speak=True, # Generate speech response
)
output.actions # → Joint commands (as usual)
output.speech_path # → "/tmp/neon_speech_xxx.wav" (played on robot's speaker)
Training Data¶
| Dataset | Examples | Content |
|---|---|---|
cagataydev/vlm-voice-audio |
1,000 | PersonaPlex-synthesized spoken commands in 8 voices |
cagataydev/vlm-voice-commands |
50,000 | Robot commands across 10 categories |
The 10 Categories¶
| Category | Count | Example |
|---|---|---|
pick_place |
222 | "Pick up the black jar and place it on the box" |
manipulation |
149 | "Carefully twist the lid off the bottle" |
navigation |
113 | "Take the path through the left side of the table" |
multistep |
109 | "First grab the sponge, then wipe the counter, then put it away" |
observation |
105 | "What color is the object in front of you?" |
spatial |
78 | "Trace a line from the sponge to the book" |
household |
77 | "Please sweep the floor near the fridge" |
safety |
69 | "Stop immediately and back away" |
conversational |
42 | "How are you doing today?" |
context_rich |
36 | "Remember where you put the screwdriver earlier" |
Configuration¶
config = NeonConfig(
control_mode="arms_only",
audio={
"enabled": True,
"encoder_type": "whisper",
"whisper_model": "openai/whisper-base",
"freeze_encoder": True,
"enable_speech_output": True,
"personaplex_voice": "NATM1",
"personaplex_host": "http://192.168.1.151:8400", # Thor
},
)
Set audio=None (default) to disable audio entirely. Backward compatible. The robot works fine with text only — audio is additive, not required.
→ Ready to build? Training — train your own Neon model