Examples¶
Real, end-to-end examples - every one driven against live model inference, not mocked. Run any with the repo on your path:
| Example | Layer | What it proves |
|---|---|---|
multimodal_agent.py |
brain | image block → VLM agent ("Green.") |
multimodal_advanced.py |
brain | video round-trip + tool-result image ("BRIGHTER.", "Blue.") |
document_and_audio.py |
brain+tool | document block + real TTS→ASR round-trip |
audio_content_block.py |
brain | audio content block → audio-native model |
omni_audio.py |
brain | Qwen2.5-Omni audio-in and speech-out |
smolvlm_image_text.py |
tool | real VLM via the run path |
multimodal_pipelines.py |
tool | text/image/audio pipelines + ASR round-trip |
vision_tasks.py |
tool | detection, embeddings, depth, segmentation |
cosmos_reason_embodied.py |
tool | Cosmos-Reason2 embodied scene reasoning |
robot_reason_act_agent.py |
tool | two-step robot agent: Cosmos-Reason plans → MolmoAct acts |
molmoact_vla.py |
tool | VLA robot actions [1,30,6] |
openvla_vla.py |
tool | 7-DoF VLA + legacy compat |
local_model_agent.py |
brain | local causal-LM brain + tool |
gap_tasks.py |
tool | zero-shot image/detection + audio/video classification |
smoke.py |
- | fast E2E gate (no big downloads) |
- brain = uses
TransformerModelas the agent's model provider. - tool = uses
use_transformersas a tool the agent calls.
FAQ & troubleshooting¶
A Qwen3 reply came back empty / all reasoning.
Qwen3 spends tokens inside <think>…</think> first. Raise max_tokens, or
set enable_thinking=False. Reasoning streams separately as reasoningContent.
mp3 / flac / ogg audio won't decode.
WAV works out of the box (stdlib). For compressed formats install the extra:
uv pip install -e \".[audio]\" (pulls in soundfile). Raw numpy waveforms
always work.
trust_remote_code models (VLA, Omni).
TransformerModel and the call path pass trust_remote_code=True by
default. Legacy 4.x-era models are auto-patched by core/compat.py.
Qwen2.5-Omni didn't speak.
Speech is off by default (keeps text fast). Set
model.update_config(speak=True), then read the waveform with
model.get_last_audio() → (np.float32, 24000).
Out of memory.
Drop to a smaller model (see Choosing a model), or force device=\"cpu\".
The provider uses bf16 on GPU automatically.
Where do generated images / audio go?
The run path writes media to disk and returns the path in the result's
artifacts list (e.g. a TTS .wav).