Skip to content

API Reference

Models

CosmosVisionModel

The primary model class — supports video, image, and text input.

from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(
    model_id: str = "nvidia/Cosmos-Reason2-2B",
    device_map: str = "auto",
    torch_dtype: str = "auto",
    reasoning: bool = False,
    fps: int = 4,
    min_vision_tokens: int = 256,
    max_vision_tokens: int = 8192,
    params: dict = {},
)
Parameter Type Default Description
model_id str nvidia/Cosmos-Reason2-2B HuggingFace model ID
device_map str auto GPU device placement
torch_dtype str auto Tensor dtype (float16/bfloat16)
reasoning bool False Enable chain-of-thought <think> reasoning
fps int 4 Video frame sampling rate
min_vision_tokens int 256 Minimum visual tokens per frame
max_vision_tokens int 8192 Maximum visual tokens per frame
params dict {} Generation params: max_tokens, temperature, top_p

CosmosModel

Text-only model — same interface but no vision capabilities.

from strands_cosmos import CosmosModel

model = CosmosModel(model_id="nvidia/Cosmos-Reason2-2B")

Cosmos3ReasonerModel

NEW (Cosmos 3). Omnimodal Reasoner — text + vision → text — served by a local vLLM server. Captioning, temporal localization, embodied next-action, 2D grounding, physical plausibility, situation understanding, action CoT.

from strands import Agent
from strands_cosmos import Cosmos3ReasonerModel

model = Cosmos3ReasonerModel(
    model_id: str = "nvidia/Cosmos3-Nano",
    base_url: str = "http://localhost:8000/v1",
    reasoning: bool = False,        # explicit <think> reasoning
    max_tokens: int = 4096,
    seed: int | None = 0,
    media_io_kwargs: dict | None = None,    # e.g. {"video": {"fps": 4.0}}
    mm_processor_kwargs: dict | None = None, # e.g. {"size": {"shortest_edge": 1568}}
)
agent = Agent(model=model)
agent("Caption in detail: <video>scene.mp4</video>")

Start the server first:

just c3-setup-reason      # one-time: vllm==0.21.0 + vllm-cosmos3 (cu130)
just c3-serve-reason      # serve Cosmos3-Nano on :8000 (--max-model-len 32768)
Parameter Type Default Description
model_id str nvidia/Cosmos3-Nano Served model (auto-resolved from server)
base_url str http://localhost:8000/v1 vLLM OpenAI endpoint
reasoning bool False Append <think> format + use reasoning sampling preset
max_tokens int 4096 Output token cap
media_io_kwargs dict None Video frame sampling passthrough
mm_processor_kwargs dict None Per-image resize bounds passthrough

Inline media tags: <video>path-or-url</video>, <image>path-or-url</image>.

Cosmos3GeneratorModel

NEW (Cosmos 3). Omnimodal Generator — text/image → image/video/sound — in-process via HuggingFace Diffusers Cosmos3OmniPipeline (no server).

from strands_cosmos import Cosmos3GeneratorModel

m = Cosmos3GeneratorModel(model_id="nvidia/Cosmos3-Nano")

m.generate(mode="text2image",  prompt="A robot in a warehouse.",
           out_path="img.png", resolution="480")
m.generate(mode="text2video",  prompt="A robot navigates a warehouse.",
           out_path="vid.mp4", num_frames=49, fps=16, num_inference_steps=25)
m.generate(mode="image2video", prompt="It starts moving.", image="img.png",
           out_path="i2v.mp4")
m.generate(mode="text2video-with-sound", prompt="A robot pours water.",
           out_path="av.mp4", enable_sound=True)   # H264 + AAC stereo 48kHz

Setup: just c3-setup-gen (diffusers main + cosmos_guardrail, cu130).

generate() arg Type Default Description
mode str text2video text2image / text2video / image2video / text2video-with-sound
prompt str "" Positive text prompt
out_path str /tmp/cosmos3_out.mp4 Output file (.png for image)
image str None Input image (image2video)
num_frames int 189 Frame count (1 for image)
fps int 24 Frames per second
resolution str 720 256 / 480 / 720
num_inference_steps int 35 Diffusion steps
guidance_scale float 6.0 CFG scale
enable_sound bool False Generate + mux stereo audio (AAC 48kHz)
seed int 0 Reproducibility seed

Single-GPU note: the reasoner (vLLM) and generator (Diffusers) each load a 16B model — they won't fit on one ~46GB GPU together. Stop one before the other.


Tools

All tools are @tool-decorated functions compatible with any Strands Agent.

Reason2 VLM

Tool Parameters Description
cosmos_inference prompt, image_path?, video_path?, server_url? Query TRT-Edge-LLM inference server
cosmos_reason_hf prompt, image_path?, video_path?, max_new_tokens?, model_id? Direct HF Transformers inference (no server needed)
cosmos_serve action (start/stop/status) Manage TRT-Edge-LLM server lifecycle

World Models

Tool Parameters Description
cosmos_predict_generate config_path Generate future video frames with Predict2.5
cosmos_transfer_generate config_path Video-to-video with Transfer2.5 (ControlNet)

Model Lifecycle

Tool Parameters Description
cosmos_model_download name, local_dir?, kind? Download model from HuggingFace
cosmos_quantize model_dir, output_dir?, precision? FP8/INT8 quantization
cosmos_export_onnx model_dir, output_dir? Export to ONNX format
cosmos_build_engine onnx_dir, output_dir?, component? Build TRT engine (LLM or visual)

Training

Tool Parameters Description
cosmos_post_train config_path, method? Post-training (SFT, LoRA, full)
cosmos_distill config_path Knowledge distillation (8B→2B)

Data & Evaluation

Tool Parameters Description
cosmos_curate config_path Run Xenna data curation pipeline
cosmos_evaluate config_path, metrics? Evaluate with FID/FVD/CSE/CLIP

I/O & Media

Tool Parameters Description
rtp_capture_frame port?, output_path? Capture single frame from RTP/GStreamer stream
nats_publish subject, payload Publish JSON to NATS subject
video_probe video_path Get video metadata (resolution, fps, duration, codec)
video_extract_frames video_path, output_dir, fps?, max_frames? Extract frames as JPEGs
image_read image_path Read image as base64 string

System

Tool Parameters Description
cosmos_sysinfo GPU info, platform, memory, CUDA version

Cosmos 3 — Reasoner (vLLM)

Tool Description
cosmos3_reason Generic reasoner: prompt + image/video → text
cosmos3_caption Detailed video/image captioning
cosmos3_temporal Event detection + timestamps
cosmos3_embodied Next-action prediction (robotics)
cosmos3_ground 2D bounding-box grounding (JSON)
cosmos3_plausibility Physical plausibility classification
cosmos3_situation Situation understanding + next action
cosmos3_action_cot Trajectory / driving chain-of-thought

Cosmos 3 — Generator (Diffusers, in-proc)

Tool Description
cosmos3_text2image Text → image (PNG)
cosmos3_text2video Text → video (MP4)
cosmos3_image2video Image + text → video
cosmos3_text2video_sound Text → video + synchronized audio (AAC stereo 48kHz)
cosmos3_video2video Re-render an input video with a new prompt (transfer; vLLM-Omni Docker)

Cosmos 3 — Action / World-Model (Cosmos Framework)

Tool Description
cosmos3_forward_dynamics Start image + action chunk → future video
cosmos3_inverse_dynamics Video + instruction → predicted action chunk
cosmos3_policy Image + instruction → action chunk + rollout video

Cosmos 3 — Servers

Tool Description
cosmos3_serve Start/stop/status local vLLM (reason) / vLLM-Omni (omni) servers

Cosmos 3 — Post-Training (SFT)

Supervised fine-tuning via the Cosmos Framework (torchrun). Tested upstream on 8× H100.

Tool Description
cosmos3_train_recipes List SFT recipes + launch shells
cosmos3_train_show Validate/print a recipe's resolved config (dry run)
cosmos3_train_convert Base checkpoint → PyTorch DCP
cosmos3_train_convert_vlm LM → Qwen3-VL visual tower (reasoner VLM)
cosmos3_train_prep_dataset captions JSONL → SFT dataset JSONL
cosmos3_train Run SFT via the paired launch shell
cosmos3_train_export Trained DCP → HF safetensors

See the Cosmos 3 Training guide for the full flow.

Legacy (backward-compatible)

Tool Parameters Description
cosmos_invoke prompt, model_id? Text-only inference tool
cosmos_vision_invoke prompt, media_path?, model_id? Vision inference tool

Task Prompts

Pre-defined prompts optimized for specific tasks:

from strands_cosmos.cosmos_vision_model import TASK_PROMPTS
Key Use Case
caption Detailed video/image captioning
embodied_reasoning Robot workspace analysis
driving Dashcam driving safety
causal Physical cause-and-effect
temporal_localization Event timestamps in video
2d_grounding Bounding box coordinates
robot_cot Step-by-step robot planning
describe_anything General scene description
mvp_bench MVP benchmark evaluation

CLI

strands-cosmos-fix-cublas

Fix CUBLAS compatibility on NVIDIA Jetson devices.

strands-cosmos-fix-cublas           # Auto-detect and fix
strands-cosmos-fix-cublas --check   # Check status only
strands-cosmos-fix-cublas --revert  # Restore original

Justfile Recipes

Run just --list for all available recipes. Key ones:

just setup          # Clone all Cosmos ecosystem repos
just setup-full     # Full setup (apt + pip + repos + doctor)
just doctor         # Diagnose platform, tools, GPU
just install-trt-edge-llm  # Build TRT-Edge-LLM from source

just serve-start    # Start TRT inference server
just serve-stop     # Stop server
just predict-generate config.json
just transfer-generate config.json
just evaluate config.json

Environment Variables

Variable Description Default
COSMOS_MODEL_ID Default HF model nvidia/Cosmos-Reason2-2B
COSMOS_SERVER_URL TRT server endpoint http://127.0.0.1:8080
NATS_URL NATS server URL nats://127.0.0.1:4222
RTP_PORT RTP receive port 5600
HF_TOKEN HuggingFace token for gated models
COSMOS_PREDICT_REPO Path to cosmos-predict2.5 clone ../cosmos-predict2.5
COSMOS_TRANSFER_REPO Path to cosmos-transfer2.5 clone ../cosmos-transfer2.5
COSMOS_REASON_REPO Path to cosmos-reason2 clone ../cosmos-reason2
COSMOS_XENNA_REPO Path to cosmos-xenna clone ../cosmos-xenna
COSMOS_COOKBOOK_REPO Path to cosmos-cookbook clone ../cosmos-cookbook