Robot world models¶

This is why strands-diffusers exists. A world model doesn't just paint a believable future — it predicts the robot actions that produce it. NVIDIA Cosmos is the headline: one call returns a future video, an optional sound track, and a normalized action tensor your controller can run.

Every clip below is a real nvidia/Cosmos3-Nano rollout (bf16/cuda) from a single use_diffusers call — not a placeholder.

frame + prompt -> world model -> world video + action chunk

Action-policy rollouts¶

Same robot, same starting observation — different task prompt, different imagined future and different predicted actions. This is the policy mode: condition on a first observation, give it a task, get back the world it would create and the action chunk (1, 16, 10) that gets there.

"Put the pot to the left of the purple item."
"Pick up the cloth and place it in the bowl."
"Move the gripper toward the metal pan and grasp the handle."
"Open the drawer and place the spoon inside."
"Wipe the surface with the cloth in a circular motion."
text-to-world (no action conditioning)

from strands_diffusers import use_diffusers

# 1) build an action condition from an observation video (low-level `call`)
use_diffusers(action="call", target="CosmosActionCondition",
              parameters={"mode": "policy", "chunk_size": 16,
                          "domain_name": "bridge_orig_lerobot",
                          "resolution_tier": 480, "video": "robot.mp4",
                          "view_point": "ego_view"},
              cache_key="cond")

# 2) run the world-foundation pipeline, threading the cached condition in
use_diffusers(action="run", pipeline="Cosmos3OmniPipeline",
              model="nvidia/Cosmos3-Nano",
              parameters={"prompt": "Put the pot to the left of the purple item.",
                          "action": "cached:cond", "fps": 5,
                          "num_inference_steps": 30, "guidance_scale": 1.0},
              dtype="bfloat16", device="cuda")

Each call above returned a world video (17, 480, 640, 3) and an action chunk (1, 16, 10), normalized to [-1, 1].

The headline: world + action from one call¶

Cosmos world rollout — The **world video** Cosmos predicts.

Cosmos action chunk — The **robot action chunk** that produces it, rendered straight from the predicted tensor.

Three action modes¶

Cosmos exposes the full physical-AI loop through one CosmosActionCondition — the same object, three directions through the world model:

the three Cosmos modes side by side: policy, forward dynamics, inverse dynamics

flowchart LR
    subgraph In["conditioning"]
        F["first frame"]
        P["task prompt"]
        RA["raw actions"]
        OV["observed video"]
    end

    F --> Policy
    P --> Policy
    F --> FD
    RA --> FD
    OV --> ID

    subgraph Modes["CosmosActionCondition"]
        Policy["policy"]
        FD["forward_dynamics"]
        ID["inverse_dynamics"]
    end

    Policy --> WV["world video<br/><small>.mp4</small>"]
    Policy --> AC["action chunk<br/><small>.json · [-1,1]</small>"]
    FD --> WV
    ID --> AC

mode	conditioning	predicts	use it for
`policy`	first frame + task prompt	future video + actions	"what should the robot do?"
`forward_dynamics`	first frame + given `raw_actions`	future video	"what happens if I run these actions?"
`inverse_dynamics`	an observed video	the actions between frames	"what actions produced this?"

# forward dynamics: roll the world forward from actions you already have
use_diffusers(action="call", target="CosmosActionCondition",
              parameters={"mode": "forward_dynamics", "chunk_size": 16,
                          "domain_name": "agibotworld", "resolution_tier": 480,
                          "image": "first_frame.png", "raw_actions": chunks},
              cache_key="fd")

# inverse dynamics: recover the actions from a video you observed
use_diffusers(action="call", target="CosmosActionCondition",
              parameters={"mode": "inverse_dynamics", "chunk_size": 16,
                          "domain_name": "bridge_orig_lerobot",
                          "video": "observed.mp4"}, cache_key="id")

Inverse dynamics, seen¶

Feed an observed robot video; Cosmos reconstructs the world and infers the action chunk that connects the frames. Observed input (left) → model rollout (right):

inferred av0 — reconstructed world + inferred actions

See the motion¶

use_diffusers(action="visualize", ...) turns any action chunk into plots and an animation so you can read the trajectory before you ever touch hardware:

time-series (every dim, gripper highlighted)	end-effector path (dims 0–2)

use_diffusers(action="visualize", inputs=action_chunk, parameters={"fps": 8})
# -> timeseries.png, trajectory.png, animation.mp4

What comes back¶

A Cosmos Cosmos3OmniPipelineOutput carries three fields, and the serializer writes each to the right artifact — no opaque repr strings:

field	artifact	notes
`video`	`.mp4`	future world, via `export_to_video`
`sound`	`.wav`	optional audio track
`action`	`.json`	normalized `[-1, 1]`, shape `[num_chunks, T, action_dim]`

{
  "type": "action",
  "chunk_shape": [16, 10],
  "num_chunks": 1,
  "path": "/tmp/strands_diffusers/action_*.json"
}

The .json is the model-normalized chunk (values in [-1, 1]). Feed it straight to your embodiment's un-normalizer and controller. Values are preserved exactly (no lossy clipping); bf16 tensors are upcast to f32 before serialization.

The WFM family¶

use_diffusers(action="wfm") lists every world-foundation / action-capable pipeline diffusers ships — discovered at runtime, never hardcoded:

use_diffusers(action="wfm")
# ['Cosmos2TextToImagePipeline', 'Cosmos2VideoToWorldPipeline',
#  'CosmosTextToWorldPipeline', 'CosmosVideoToWorldPipeline',
#  'HunyuanVideoPipeline', 'WanImageToVideoPipeline', 'WanPipeline', ...]

27 world-foundation models today; the list grows automatically as diffusers adds new architectures.

Reproduce these¶

The rollout gallery above is generated by a real, GPU script:

pip install 'git+https://github.com/huggingface/diffusers' --no-deps --target /tmp/dmain
PYTHONPATH=/tmp/dmain python examples/generate_wfm_rollouts.py

See also examples/cosmos_action_policy.py for the single-rollout walkthrough with a graceful CPU/no-GPU fallback.