Cosmos 3 — Omnimodal World Models¶

Cosmos 3 is NVIDIA's omnimodal world-model family built on a unified Mixture-of-Transformers (MoT) architecture that jointly processes and generates language, images, video, audio, and action sequences. strands-cosmos provides first-class support for it as Strands model providers + justfile-backed tools — running entirely on local compute.

Two runtime surfaces¶

Surface	Inputs	Outputs	strands-cosmos artifact
Reasoner	text, vision	text	`Cosmos3ReasonerModel` (vLLM)
Generator	text, vision, sound, action	vision, sound, action	`Cosmos3GeneratorModel` (Diffusers) + Cosmos Framework (action)

Model family¶

Model	Size	Role
`nvidia/Cosmos3-Nano`	16B	Omnimodal — fits a single ~46GB GPU
`nvidia/Cosmos3-Super`	64B	Frontier-scale (multi-GPU / tensor-parallel)
`nvidia/Cosmos3-Nano-Policy-DROID`	16B	VL robot policy (DROID)

Hardware & CUDA pairing¶

Cosmos 3 backends pin a CUDA build of torch/vllm that must match your driver:

Driver CUDA	torch backend	vLLM
13.x	`cu130`	`vllm==0.21.0`
12.8	`cu128`	`vllm==0.19.1`

just c3-doctor reports your GPU, driver CUDA, the recommended pairing, venv status, and free disk.

Single-GPU memory

The Reasoner (vLLM) and the Generator (Diffusers) each load a 16B model. On a single ~46GB GPU they cannot run simultaneously — stop one before starting the other, or dedicate separate GPUs.

Reasoner — video & image understanding (vLLM)¶

just c3-setup-reason      # one-time: vllm==0.21.0 + vllm-cosmos3 (cu130)
just c3-serve-reason      # serve Cosmos3-Nano on :8000 (--max-model-len 32768)

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel

agent = Agent(model=Cosmos3ReasonerModel(base_url="http://localhost:8000/v1"))

# Detailed captioning
agent("Caption in detail: <video>scene.mp4</video>")

# Temporal localization with timestamps
agent("List the notable events with approximate timestamps: <video>scene.mp4</video>")

# Embodied next-action with explicit reasoning
agent.model.update_config(reasoning=True)
agent("What is the most likely next action? <video>robot.mp4</video>")

Reasoner capabilities (each has a dedicated tool):

Tool	Task
`cosmos3_caption`	Detailed captioning
`cosmos3_temporal`	Event detection + timestamps
`cosmos3_embodied`	Next-action prediction
`cosmos3_ground`	2D bounding boxes (JSON)
`cosmos3_plausibility`	Physical plausibility label
`cosmos3_situation`	Situation understanding
`cosmos3_action_cot`	Trajectory / driving CoT

Generator — image, video & sound (Diffusers, in-process)¶

# Option A — pip extra (Diffusers + cosmos_guardrail + soundfile):
pip install "strands-cosmos[cosmos3-gen]"
# Cosmos3OmniPipeline needs the diffusers dev build:
pip install -U "git+https://github.com/huggingface/diffusers.git"

# Option B — justfile (dedicated CUDA-matched venv):
just c3-setup-gen   # diffusers(main) + cosmos_guardrail + soundfile (cu130)

from strands_cosmos import Cosmos3GeneratorModel
m = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")

# Text → image
m.generate(mode="text2image", prompt="A robot in a warehouse.",
           out_path="img.png", resolution="480")

# Text → video
m.generate(mode="text2video", prompt="A robot navigates a warehouse aisle.",
           out_path="vid.mp4", num_frames=49, fps=16, num_inference_steps=25)

# Image → video
m.generate(mode="image2video", prompt="It begins to move forward.",
           image="img.png", out_path="i2v.mp4")

# Text → video WITH SOUND (H264 video + AAC stereo 48kHz)
m.generate(mode="text2video-with-sound", prompt="A robot arm pours water.",
           out_path="av.mp4", enable_sound=True)

Sound is generated in-process by the omni pipeline (Cosmos3OmniPipelineOutput returns both video frames and a stereo sound tensor); strands-cosmos muxes it into the MP4 via ffmpeg. No vLLM-Omni server is required for image/video/sound.

Action / World-Model (Cosmos Framework)¶

Action generation (forward/inverse dynamics, policy) runs through the native Cosmos Framework via torchrun.

just c3-setup-framework   # one-time: clone cosmos-framework + uv sync (cu130-train)

Each run is described by a JSONL spec (one line per run):

{
  "model_mode": "forward_dynamics",
  "name": "av_forward",
  "vision_path": ".../images/av_0.jpg",
  "action_path": ".../actions/av_traj_forward.json",
  "domain_name": "av",
  "action_chunk_size": 60,
  "fps": 10,
  "image_size": 480,
  "view_point": "ego_view",
  "prompt": "You are an autonomous vehicle planning system.",
  "seed": 0
}

from strands_cosmos import cosmos3_forward_dynamics
cosmos3_forward_dynamics(input_jsonl="fd_av.jsonl", out="/tmp/c3_action_out")
# → /tmp/c3_action_out/av_forward/vision.mp4  (future-state rollout)

Tool	Task
`cosmos3_forward_dynamics`	start image + action chunk → future video
`cosmos3_inverse_dynamics`	video + instruction → predicted action chunk
`cosmos3_policy`	image + instruction → action chunk + rollout video

Embodiments include autonomous vehicle (av), DROID/Bridge/UMI single-arm robots, dual-arm, and humanoid — see the Cosmos cookbooks.

justfile reference¶

just c3-doctor             # environment check (GPU/CUDA/uv/venvs/disk)
just c3-setup-reason       # Reasoner env (vLLM + vllm-cosmos3)
just c3-setup-gen          # Generator env (Diffusers)
just c3-setup-framework    # Action env (Cosmos Framework)
just c3-serve-reason       # start Cosmos3-Nano reasoner server
just c3-serve-stop-reason  # stop it
just c3-serve-status       # server status
just c3-reason "<prompt>" "<image>" "<video>" "<task>"
just c3-gen <mode> "<prompt>" "<image>" <out>
just c3-action <input.jsonl> <out>