Skip to content

Cosmos 3 — Post-Training (SFT)

Fine-tune Cosmos 3 models on your own data via NVIDIA's cosmos-framework supervised fine-tuning (SFT) stack. strands-cosmos exposes the framework's training flow as cosmos3_train_* tools + c3-train-* justfile recipes (thin wrappers — the framework is the single source of truth).

Hardware

Full SFT is tested upstream on 8× H100 (80 GB). The convert, dataset-prep, and config-validation steps run on any GPU (or none), so you can wire and validate the whole pipeline locally; the actual training run needs the documented multi-GPU allocation.

Setup

just c3-setup-framework   # clone cosmos-framework -> ../cosmos/packages/cosmos3 + uv sync (cu130-train)
just c3-doctor            # confirms the training (SFT) env is present

Recipes

just c3-train-recipes     # list SFT recipes + paired launch shells
Recipe Surface Dataset Base checkpoint
vision_sft_nano Generator (T2V/I2V/V2V) bridge-v2-subset-synthetic-captions Cosmos3-Nano
vision_sft_super Generator (LoRA, 64B) bridge-v2-subset-synthetic-captions Cosmos3-Super
llava_ov Reasoner alignment LLaVA-OneVision (HF stream) Qwen3-VL-8B (fetched)
videophy2_nano Reasoner alignment VideoPhy-2 Cosmos3-Nano-VLM

The 4-step flow

1. Convert the base checkpoint → DCP

just c3-train-convert Cosmos3-Nano             # -> examples/checkpoints/Cosmos3-Nano (DCP)
# Reasoner VLM path instead:
just c3-train-convert-vlm Cosmos3-Nano         # -> examples/checkpoints/Cosmos3-Nano-VLM
from strands_cosmos import cosmos3_train_convert
cosmos3_train_convert(checkpoint="Cosmos3-Nano")

2. (optional) Prepare your dataset

Vision recipes expect a train/video_dataset_file.jsonl. Convert a captions JSONL into the SFT format:

just c3-train-prep-dataset captions.jsonl sft_dataset.jsonl
from strands_cosmos import cosmos3_train_prep_dataset
cosmos3_train_prep_dataset(captions="captions.jsonl", out="sft_dataset.jsonl")

3. Validate the config, then run SFT

Always dry-run first (no GPU) to confirm the resolved config:

just c3-train-show vision_sft_nano             # train.py --dryrun: prints the resolved config
from strands_cosmos import cosmos3_train_show, cosmos3_train
cosmos3_train_show(recipe="vision_sft_nano")

# Full run (8 GPUs). Use Hydra tail overrides for short smokes / hyperparams:
cosmos3_train(
    recipe="vision_sft_nano",
    nproc=8,
    dataset="examples/data/.../sft_dataset_bridge",   # optional override
    checkpoint="examples/checkpoints/Cosmos3-Nano",   # the DCP from step 1
    overrides="trainer.max_iter=200 optimizer.lr=1e-5",
)

Under the hood this calls the framework's paired launch shell, which runs:

torchrun --nproc_per_node=8 -m cosmos_framework.scripts.train \
    --sft-toml=examples/toml/sft_config/vision_sft_nano.toml \
    -- trainer.max_iter=200 optimizer.lr=1e-5

4. Export the trained checkpoint → HF safetensors

just c3-train-export outputs/train/cosmos3/sft/vision_sft_nano
from strands_cosmos import cosmos3_train_export
cosmos3_train_export(run_dir="outputs/train/cosmos3/sft/vision_sft_nano")

The exported safetensors can then be served back through Cosmos3ReasonerModel (point a vLLM server at it) or loaded by Cosmos3GeneratorModel.

Tools reference

Tool Purpose
cosmos3_train_recipes List SFT recipes + launch shells
cosmos3_train_show Validate/print a recipe's resolved config (dry run)
cosmos3_train_convert Base checkpoint → PyTorch DCP
cosmos3_train_convert_vlm LM → Qwen3-VL visual tower (reasoner VLM)
cosmos3_train_prep_dataset captions JSONL → SFT dataset JSONL
cosmos3_train Run SFT via the paired launch shell
cosmos3_train_export Trained DCP → HF safetensors

Tip: every recipe TOML defaults to job.wandb_mode = "disabled". Set it to "online" and export WANDB_API_KEY to log a run to Weights & Biases.

See the cosmos-framework training docs for dataset licensing, OOM tuning, and the full Hydra override reference.