Examples¶

Real, end-to-end examples - every one driven against live model inference, not mocked. Run any with the repo on your path:

PYTHONPATH=. python examples/<name>.py

Example	Layer	What it proves
`multimodal_agent.py`	brain	image block → VLM agent (`"Green."`)
`multimodal_advanced.py`	brain	video round-trip + tool-result image (`"BRIGHTER."`, `"Blue."`)
`document_and_audio.py`	brain+tool	document block + real TTS→ASR round-trip
`audio_content_block.py`	brain	audio content block → audio-native model
`omni_audio.py`	brain	Qwen2.5-Omni audio-in and speech-out
`smolvlm_image_text.py`	tool	real VLM via the `run` path
`multimodal_pipelines.py`	tool	text/image/audio pipelines + ASR round-trip
`vision_tasks.py`	tool	detection, embeddings, depth, segmentation
`cosmos_reason_embodied.py`	tool	Cosmos-Reason2 embodied scene reasoning
`robot_reason_act_agent.py`	tool	two-step robot agent: Cosmos-Reason plans → MolmoAct acts
`molmoact_vla.py`	tool	VLA robot actions `[1,30,6]`
`openvla_vla.py`	tool	7-DoF VLA + legacy compat
`local_model_agent.py`	brain	local causal-LM brain + tool
`gap_tasks.py`	tool	zero-shot image/detection + audio/video classification
`smoke.py`	-	fast E2E gate (no big downloads)

brain = uses TransformerModel as the agent's model provider.
tool = uses use_transformers as a tool the agent calls.

FAQ & troubleshooting¶

A Qwen3 reply came back empty / all reasoning.

Qwen3 spends tokens inside <think>…</think> first. Raise max_tokens, or set enable_thinking=False. Reasoning streams separately as reasoningContent.

mp3 / flac / ogg audio won't decode.

WAV works out of the box (stdlib). For compressed formats install the extra: uv pip install -e \".[audio]\" (pulls in soundfile). Raw numpy waveforms always work.

trust_remote_code models (VLA, Omni).

TransformerModel and the call path pass trust_remote_code=True by default. Legacy 4.x-era models are auto-patched by core/compat.py.

Qwen2.5-Omni didn't speak.

Speech is off by default (keeps text fast). Set model.update_config(speak=True), then read the waveform with model.get_last_audio() → (np.float32, 24000).

Out of memory.

Drop to a smaller model (see Choosing a model), or force device=\"cpu\". The provider uses bf16 on GPU automatically.

Where do generated images / audio go?

The run path writes media to disk and returns the path in the result's artifacts list (e.g. a TTS .wav).