Skip to content
Strands Cosmos

Strands Cosmos

NVIDIA Cosmos for Strands Agents โ€” omnimodal world-model reasoning and generation, on local compute.

Cosmos models become first-class Strands model providers: give your agent eyes that understand physics, and hands that can generate video, audio, and robot actions โ€” plus 45 tools spanning the full Cosmos pipeline.

Family Providers Best for
Cosmos 3 (latest, omnimodal) Cosmos3ReasonerModel, Cosmos3GeneratorModel Video/image/audio/action understanding + generation
Cosmos-Reason2 (VLM) CosmosVisionModel, CosmosModel Lightweight edge VLM (Jetson Thor/Orin)

๐ŸŒŒ Cosmos 3 โ€” Reason โ†’ Generate

Cosmos 3 is NVIDIA's newest model family โ€” a unified Mixture-of-Transformers that jointly understands and generates text, images, video, audio, and action. Here it watches a real construction-site clip, describes it, then generates new videos (one with synchronized audio) from its own description โ€” all on a single local GPU.

โ‘  Input videoโ‘ก Cosmos 3 understands it
input > *"Two construction workers wearing yellow safety vests and helmets are walking away from the camera on a dirt path within a bustling construction site. The ground is covered in loose soil, with visible tire tracks crisscrossing the surface. In the background, a large yellow front-end loader moves slowly across the site, its bucket raised slightly as it navigates the terrain. Behind the loader, partially obscured by rebar and concrete slabs, an excavator operates near a foundation area. The scene is framed by urban buildings in the distance, including a distinctive church-like structure with a tall spire and modern glass-fronted buildings. The overall atmosphere suggests active progress on a significant infrastructure project under clear daylight conditions."* โ€” `Cosmos3ReasonerModel` (caption in 5.2s)

The reasoner distills its own understanding into a generation prompt:

"Two construction workers in yellow safety vests and helmets walk across a dusty site, gesturing toward a yellow front loader and distant excavator as they converse."

Then Cosmos3GeneratorModel generates similar videos from that prompt (832ร—480, 49 frames):

text โ†’ videotext โ†’ video + ๐Ÿ”Š soundimage โ†’ video
text2video text2video+sound image2video
55.5s 43.2s ยท AAC stereo 48kHz 42.1s ยท from a real frame

โ†’ Full walkthrough: Cosmos 3 Guide ยท reproduce with python examples/09_cosmos3_showcase.py

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel, Cosmos3GeneratorModel

# Reasoner โ€” text + vision -> text (local vLLM; start with `just c3-serve-reason`)
agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))
agent("Caption in detail: <video>scene.mp4</video>")

# Generator โ€” text/image -> image/video/sound (in-process Diffusers, no server)
gen = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")
gen.generate(mode="text2video-with-sound", prompt="A robot pours water.",
             out_path="av.mp4", enable_sound=True)   # H264 + AAC stereo 48kHz

Single-GPU note

The reasoner (vLLM) and generator (Diffusers) each load a 16B model โ€” on one ~46GB GPU, stop one before running the other. CUDA pairing: CUDA 13 โ†’ cu130 + vllm==0.21.0. just c3-doctor reports your driver's recommendation.


Cosmos-Reason2 โ€” Lightweight Edge VLM

For edge/Jetson, the Cosmos-Reason2 VLM runs as a model provider with a tiny footprint โ€” verified on Jetson AGX Thor with Chain-of-Thought reasoning.

graph LR
    A["๐Ÿ—ฃ๏ธ Strands Agent"] --> RCosmos 3
    R -->|Reasoner| U["๐Ÿ“น Understand: caption ยท temporal ยท embodied ยท grounding"]
    R -->|Generator| G["๐ŸŽฌ Generate: image ยท video ยท ๐Ÿ”Š audio ยท ๐Ÿค– action"]
    A --> VCosmos-Reason2 VLM
    V -->|Edge| E["๐Ÿš— Driving ยท Robot planning ยท CoT"]

Get Started in 2 Minutes

pip install strands-cosmos

For Cosmos 3, see the Cosmos 3 Guide (just c3-setup-reason / just c3-setup-gen). For the Reason2 edge VLM, it works straight from pip:

from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)

# Analyze a dashcam video
agent("Caption in detail: <video>dashcam.mp4</video>")

# Reason about a robot's view
agent("<image>robot_view.jpg</image> What should the robot do next?")

# Physics understanding (text-only)
agent("What happens when you push a ball off the edge of a table?")

โ†’ Full Quickstart | Installation


Capabilities

  • ๐Ÿš— Driving Analysis

    Traffic, hazards, navigation from dashcam video

    โ†’ Driving example

  • ๐Ÿค– Robot Planning

    Next-action prediction, 2D trajectory planning

    โ†’ Embodied reasoning

  • ๐ŸŽฌ Video Captioning

    Detailed temporal-spatial descriptions

    โ†’ Video captioning

  • โš›๏ธ Physics Reasoning

    Object permanence, causality, plausibility

    โ†’ Text reasoning

  • ๐Ÿ” 2D Grounding

    Bounding box localization in images

  • ๐Ÿง  Chain-of-Thought

    <think> reasoning before answers

    โ†’ CoT guide


Models

Cosmos 3 (omnimodal โ€” reasoning + generation):

Model Size Capability
Cosmos3-Nano 16B Omnimodal (reasoner + generator + action) โ€” fits a single ~46GB GPU
Cosmos3-Super 64B Frontier-scale (multi-GPU / tensor-parallel)
Cosmos3-Nano-Policy-DROID 16B VL robot policy (DROID)

Cosmos-Reason2 (lightweight edge VLM):

Model GPU Memory Architecture Best For
Cosmos-Reason2-2B 24 GB Qwen3-VL Edge / Jetson
Cosmos-Reason2-8B 32 GB Qwen3-VL Desktop / Cloud

Verified Platforms

Platform GPU Status
Jetson AGX Thor Thor 132 GB โœ… (with CUBLAS fix)
Desktop A100 / H100 / RTX 4090 โœ…
Jetson Orin Orin 32/64 GB โœ… (may need CUBLAS fix)

Two Ways to Use

from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)
agent("Describe this scene: <video>scene.mp4</video>")
from strands import Agent
from strands_cosmos import cosmos_reason_hf, video_probe, cosmos_sysinfo

# 45 tools available โ€” use any combination
agent = Agent(tools=[cosmos_reason_hf, video_probe, cosmos_sysinfo])
agent("Check GPU status, probe the video, then describe what you see in /tmp/scene.mp4")
from strands import Agent
from strands_cosmos import (
    cosmos_model_download, cosmos_quantize, cosmos_export_onnx,
    cosmos_build_engine, cosmos_serve, cosmos_inference,
)

# Agent orchestrates the full edge-deployment pipeline
agent = Agent(tools=[
    cosmos_model_download, cosmos_quantize, cosmos_export_onnx,
    cosmos_build_engine, cosmos_serve, cosmos_inference,
])
agent("Download Reason2-2B, quantize to FP8, export ONNX, build TRT engine, start server, and run a test query")

Performance on Jetson AGX Thor

Benchmarks with Cosmos-Reason2-2B on 132GB unified memory:

Example Task Time Recording
01 Text-only physics ~11s cast
02 Video caption (10s @ 4fps) ~15s cast
03 Driving analysis + CoT ~16s cast
04 Embodied reasoning + CoT ~43s cast
05 Tool invocation ~9s cast


Developer Setup (Full Cosmos Ecosystem)

git clone https://github.com/cagataycali/strands-cosmos && cd strands-cosmos
just setup-full    # Installs apt deps, Python deps, clones 6 Cosmos repos
just doctor        # Platform diagnostics โ€” what works on THIS machine

just doctor checks: repos, core tools, Python packages, media tools, TRT binaries, GPU/CUDA โ€” with platform-aware guidance (workstation vs Jetson vs Docker).


Resources