Strands Cosmos
NVIDIA Cosmos for Strands Agents โ omnimodal world-model reasoning and generation, on local compute.
Cosmos models become first-class Strands model providers: give your agent eyes that understand physics, and hands that can generate video, audio, and robot actions โ plus 45 tools spanning the full Cosmos pipeline.
| Family | Providers | Best for |
|---|---|---|
| Cosmos 3 (latest, omnimodal) | Cosmos3ReasonerModel, Cosmos3GeneratorModel |
Video/image/audio/action understanding + generation |
| Cosmos-Reason2 (VLM) | CosmosVisionModel, CosmosModel |
Lightweight edge VLM (Jetson Thor/Orin) |
๐ Cosmos 3 โ Reason โ Generate¶
Cosmos 3 is NVIDIA's newest model family โ a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. Here it watches a real construction-site clip, describes it, then generates new videos (one with synchronized audio) from its own description โ all on a single local GPU.
| โ Input video | โก Cosmos 3 understands it |
|---|---|
![]() |
> *"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions."* โ `Cosmos3ReasonerModel` (caption in 5.2s) |
The reasoner distills its own understanding into a generation prompt:
"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."
Then Cosmos3GeneratorModel generates similar videos from that prompt (832ร480, 49 frames):
| text โ video | text โ video + ๐ sound | image โ video |
|---|---|---|
![]() |
![]() |
![]() |
| 55.5s | 43.2s ยท AAC stereo 48kHz | 42.1s ยท from a real frame |
โ Full walkthrough: Cosmos 3 Guide ยท reproduce with python examples/09_cosmos3_showcase.py
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, Cosmos3GeneratorModel
# Reasoner โ text + vision -> text (local vLLM; start with `just c3-serve-reason`)
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")
# Generator โ text/image -> image/video/sound (in-process Diffusers, no server)
gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.",
out_path="av.mp4", enable_sound=True) # H264 + AAC stereo 48kHz
Single-GPU note
The reasoner (vLLM) and generator (Diffusers) each load a 16B model โ on one
~46GB GPU, stop one before running the other. CUDA pairing: CUDA 13 โ cu130 +
vllm==0.21.0. just c3-doctor reports your driver's recommendation.
Cosmos-Reason2 โ Lightweight Edge VLM¶
For edge/Jetson, the Cosmos-Reason2 VLM runs as a model provider with a tiny footprint โ verified on Jetson AGX Thor with Chain-of-Thought reasoning.
-
๐ Driving Analysis with Chain-of-Thought

-
๐ค Robot Embodied Reasoning

-
๐ฌ Video Captioning

-
โ๏ธ Physics Reasoning (Text-Only)

graph LR
A["๐ฃ๏ธ Strands Agent"] --> RCosmos 3
R -->|Reasoner| U["๐น Understand: caption ยท temporal ยท embodied ยท grounding"]
R -->|Generator| G["๐ฌ Generate: image ยท video ยท ๐ audio ยท ๐ค action"]
A --> VCosmos-Reason2 VLM
V -->|Edge| E["๐ Driving ยท Robot planning ยท CoT"]
Get Started in 2 Minutes¶
For Cosmos 3, see the Cosmos 3 Guide (just c3-setup-reason / just c3-setup-gen).
For the Reason2 edge VLM, it works straight from pip:
from strands import Agent
from strands_cosmos import CosmosVisionModel
model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)
# Analyze a dashcam video
agent("Caption in detail: <video>dashcam.mp4</video>")
# Reason about a robot's view
agent("<image>robot_view.jpg</image> What should the robot do next?")
# Physics understanding (text-only)
agent("What happens when you push a ball off the edge of a table?")
โ Full Quickstart | Installation
Capabilities¶
-
๐ Driving Analysis
Traffic, hazards, navigation from dashcam video
โ Driving example
-
๐ค Robot Planning
Next-action prediction, 2D trajectory planning
-
๐ฌ Video Captioning
Detailed temporal-spatial descriptions
โ Video captioning
-
โ๏ธ Physics Reasoning
Object permanence, causality, plausibility
โ Text reasoning
-
๐ 2D Grounding
Bounding box localization in images
-
๐ง Chain-of-Thought
<think>reasoning before answersโ CoT guide
Models¶
Cosmos 3 (omnimodal โ reasoning + generation):
| Model | Size | Capability |
|---|---|---|
| Cosmos3-Nano | 16B | Omnimodal (reasoner + generator + action) โ fits a single ~46GB GPU |
| Cosmos3-Super | 64B | Frontier-scale (multi-GPU / tensor-parallel) |
| Cosmos3-Nano-Policy-DROID | 16B | VL robot policy (DROID) |
Cosmos-Reason2 (lightweight edge VLM):
| Model | GPU Memory | Architecture | Best For |
|---|---|---|---|
| Cosmos-Reason2-2B | 24 GB | Qwen3-VL | Edge / Jetson |
| Cosmos-Reason2-8B | 32 GB | Qwen3-VL | Desktop / Cloud |
Verified Platforms¶
| Platform | GPU | Status |
|---|---|---|
| Jetson AGX Thor | Thor 132 GB | โ (with CUBLAS fix) |
| Desktop | A100 / H100 / RTX 4090 | โ |
| Jetson Orin | Orin 32/64 GB | โ (may need CUBLAS fix) |
Two Ways to Use¶
from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo
# 45 tools available โ use any combination
agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check GPU status, probe the video, then describe what you see in /tmp/scene.mp4")
from strands import Agent
from strands_cosmos import (
cosmos_model_download, cosmos_quantize, cosmos_export_onnx,
cosmos_build_engine, cosmos_serve, cosmos_inference,
)
# Agent orchestrates the full edge-deployment pipeline
agent = Agent(tools=[
cosmos_model_download, cosmos_quantize, cosmos_export_onnx,
cosmos_build_engine, cosmos_serve, cosmos_inference,
])
agent("Download Reason2-2B, quantize to FP8, export ONNX, build TRT engine, start server, and run a test query")
Performance on Jetson AGX Thor¶
Benchmarks with Cosmos-Reason2-2B on 132GB unified memory:
| Example | Task | Time | Recording |
|---|---|---|---|
| 01 | Text-only physics | ~11s | cast |
| 02 | Video caption (10s @ 4fps) | ~15s | cast |
| 03 | Driving analysis + CoT | ~16s | cast |
| 04 | Embodied reasoning + CoT | ~43s | cast |
| 05 | Tool invocation | ~9s | cast |
Quick Links¶
Developer Setup (Full Cosmos Ecosystem)¶
git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full # Installs apt deps, Python deps, clones 6 Cosmos repos
just doctor # Platform diagnostics โ what works on THIS machine
just doctor checks: repos, core tools, Python packages, media tools, TRT binaries, GPU/CUDA โ with platform-aware guidance (workstation vs Jetson vs Docker).



