Skip to content

Cosmos 3 — Omnimodal World Models

Cosmos 3 is NVIDIA's omnimodal world-model family built on a unified Mixture-of-Transformers (MoT) architecture that jointly processes and generates language, images, video, audio, and action sequences. strands-cosmos provides first-class support for it as Strands model providers + justfile-backed tools — running entirely on local compute.

Two runtime surfaces

Surface Inputs Outputs strands-cosmos artifact
Reasoner text, vision text Cosmos3ReasonerModel (vLLM)
Generator text, vision, sound, action vision, sound, action Cosmos3GeneratorModel (Diffusers) + Cosmos Framework (action)

Model family

Model Size Role
nvidia/Cosmos3-Nano 16B Omnimodal — fits a single ~46GB GPU
nvidia/Cosmos3-Super 64B Frontier-scale (multi-GPU / tensor-parallel)
nvidia/Cosmos3-Nano-Policy-DROID 16B VL robot policy (DROID)

Hardware & CUDA pairing

Cosmos 3 backends pin a CUDA build of torch/vllm that must match your driver:

Driver CUDA torch backend vLLM
13.x cu130 vllm==0.21.0
12.8 cu128 vllm==0.19.1

just c3-doctor reports your GPU, driver CUDA, the recommended pairing, venv status, and free disk.

Single-GPU memory

The Reasoner (vLLM) and the Generator (Diffusers) each load a 16B model. On a single ~46GB GPU they cannot run simultaneously — stop one before starting the other, or dedicate separate GPUs.


Reasoner — video & image understanding (vLLM)

just c3-setup-reason      # one-time: vllm==0.21.0 + vllm-cosmos3 (cu130)
just c3-serve-reason      # serve Cosmos3-Nano on :8000 (--max-model-len 32768)
from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel

agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))

# Detailed captioning
agent("Caption in detail: <video>scene.mp4</video>")

# Temporal localization with timestamps
agent("List the notable events with approximate timestamps: <video>scene.mp4</video>")

# Embodied next-action with explicit reasoning
agent.model.update_config(reasoning=True)
agent("What is the most likely next action? <video>robot.mp4</video>")

Reasoner capabilities (each has a dedicated tool):

Tool Task
cosmos3_caption Detailed captioning
cosmos3_temporal Event detection + timestamps
cosmos3_embodied Next-action prediction
cosmos3_ground 2D bounding boxes (JSON)
cosmos3_plausibility Physical plausibility label
cosmos3_situation Situation understanding
cosmos3_action_cot Trajectory / driving CoT

Generator — image, video & sound (Diffusers, in-process)

# Option A — pip extra (Diffusers + cosmos_guardrail + soundfile):
pip install "strands-cosmos[cosmos3-gen]"
# Cosmos3OmniPipeline needs the diffusers dev build:
pip install -U "git+https://github.com/huggingface/diffusers.git"

# Option B — justfile (dedicated CUDA-matched venv):
just c3-setup-gen   # diffusers(main) + cosmos_guardrail + soundfile (cu130)
from strands_cosmos import Cosmos3GeneratorModel
m = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")

# Text → image
m.generate(mode="text2image", prompt="A robot in a warehouse.",
           out_path="img.png", resolution="480")

# Text → video
m.generate(mode="text2video", prompt="A robot navigates a warehouse aisle.",
           out_path="vid.mp4", num_frames=49, fps=16, num_inference_steps=25)

# Image → video
m.generate(mode="image2video", prompt="It begins to move forward.",
           image="img.png", out_path="i2v.mp4")

# Text → video WITH SOUND (H264 video + AAC stereo 48kHz)
m.generate(mode="text2video-with-sound", prompt="A robot arm pours water.",
           out_path="av.mp4", enable_sound=True)

Sound is generated in-process by the omni pipeline (Cosmos3OmniPipelineOutput returns both video frames and a stereo sound tensor); strands-cosmos muxes it into the MP4 via ffmpeg. No vLLM-Omni server is required for image/video/sound.


Action / World-Model (Cosmos Framework)

Action generation (forward/inverse dynamics, policy) runs through the native Cosmos Framework via torchrun.

just c3-setup-framework   # one-time: clone cosmos-framework + uv sync (cu130-train)

Each run is described by a JSONL spec (one line per run):

{
  "model_mode": "forward_dynamics",
  "name": "av_forward",
  "vision_path": ".../images/av_0.jpg",
  "action_path": ".../actions/av_traj_forward.json",
  "domain_name": "av",
  "action_chunk_size": 60,
  "fps": 10,
  "image_size": 480,
  "view_point": "ego_view",
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0
}
from strands_cosmos import cosmos3_forward_dynamics
cosmos3_forward_dynamics(input_jsonl="fd_av.jsonl", out="/tmp/c3_action_out")
# → /tmp/c3_action_out/av_forward/vision.mp4  (future-state rollout)
Tool Task
cosmos3_forward_dynamics start image + action chunk → future video
cosmos3_inverse_dynamics video + instruction → predicted action chunk
cosmos3_policy image + instruction → action chunk + rollout video

Embodiments include autonomous vehicle (av), DROID/Bridge/UMI single-arm robots, dual-arm, and humanoid — see the Cosmos cookbooks.


justfile reference

just c3-doctor             # environment check (GPU/CUDA/uv/venvs/disk)
just c3-setup-reason       # Reasoner env (vLLM + vllm-cosmos3)
just c3-setup-gen          # Generator env (Diffusers)
just c3-setup-framework    # Action env (Cosmos Framework)
just c3-serve-reason       # start Cosmos3-Nano reasoner server
just c3-serve-stop-reason  # stop it
just c3-serve-status       # server status
just c3-reason "<prompt>" "<image>" "<video>" "<task>"
just c3-gen <mode> "<prompt>" "<image>" <out>
just c3-action <input.jsonl> <out>