Skip to content

Cosmos-Predict 2.5 (World Model)

Predict 2.5 is NVIDIA's video world model. It generates future frames from text, image, or action inputs.

Variants

Variant Input Output Use case
text2world Text prompt Video Scene synthesis from scratch
video2world Video + prompt Video continuation Forecasting
action_conditioned Video + action sequence Video Trajectory simulation, robotics
multiview Multi-cam rig Multi-view video 3D-consistent generation

Agent tool

cosmos_predict_generate(
    prompt="a robot arm placing a cup on a table",
    output_dir="./outputs/predict2_5",
    input_video="./seed.mp4",         # for video2world
    num_frames=121,
    height=720,
    width=1280,
    fps=24,
    guidance_scale=7.0,
    num_steps=35,
    seed=0,
    model_variant="video2world",
    checkpoint="./ckpts/my-finetune",  # optional
)

The tool writes parameters to a temp JSON and invokes:

just predict-generate /tmp/predict_XXX.json
# โ†’ cd $COSMOS_PREDICT_REPO
# โ†’ just run python examples/inference.py -i /tmp/predict_XXX.json

CLI

# Using the repo's own examples
just predict-generate inputs/my_video2world.json

Typical workflow

just download predict2.5-2b        # grab the base checkpoint
just post-train-predict configs/gr00t.yaml 8  # optional fine-tune
just predict-generate inputs/scene.json
just evaluate fvd ./outputs/predict2_5 ./ground-truth

Distillation (fewer steps)

cosmos_distill(
    teacher_checkpoint="./ckpts/predict2.5-teacher",
    student_output="./ckpts/predict2.5-4step",
    method="dmd2",                    # "kd" or "dmd2"
    model_family="predict2_5",
    num_gpus=8,
)

References