Skip to content

Robotics / Vision-Language-Action (VLA)

Robotics agents stack two jobs: reason about a scene (where is the cube, what to do first) and act (emit joint commands). strands-transformers covers both - and both are real, GPU-verified.

flowchart LR
    CAM["📷 camera + 🗣️ instruction"]
    CAM --> REASON["🧠 reason<br/>Cosmos-Reason2<br/><i>image-text-to-text</i>"]
    CAM --> ST["🦾 + robot state"]
    ST --> ACT["⚙️ act<br/>MolmoAct · OpenVLA<br/><i>predict_action()</i>"]
    REASON -->|plan| ACT
    ACT --> OUT["🤖 robot actions<br/>MolmoAct [1,30,6] · OpenVLA 7-DoF"]

    classDef in fill:#7C5CFF22,stroke:#7C5CFF,stroke-width:1.5px,color:#7C5CFF;
    classDef mid fill:#22D3EE1f,stroke:#22D3EE,stroke-width:1.5px,color:#0F91A6;
    classDef out fill:#8B8B9414,stroke:#8B8B9466,stroke-width:1.5px,color:#6B6B76;
    class CAM,ST in;
    class REASON,ACT mid;
    class OUT out;

See the full loop

The examples/robot_reason_act_agent.py example wires it end-to-end: Cosmos-Reason plans, MolmoAct acts - on real camera images.

Model landscape (today)

Model Kind Loads via Layer
nvidia/Cosmos-Reason2-2B physical-AI reasoning VLM run (image-text-to-text) 🧠 reason
allenai/MolmoAct2-SO100_101 VLA (predict_action) call ⚙️ act
openvla/openvla-7b VLA (predict_action) call ⚙️ act

Reasoning vs action models

Reasoning models (Cosmos-Reason) are standard image-text-to-text VLMs - they describe the scene and plan, and run through the high-level run path. Action (VLA) models expose a custom predict_action that emits raw joint commands, so they go through the low-level call path. A good agent uses the reasoner to plan, then a VLA to execute.

🧠 Reason - Cosmos-Reason2 (examples/cosmos_reason_embodied.py)

NVIDIA's physical-AI VLM (Qwen3VLForConditionalGeneration) - a normal image-text-to-text model, so the high-level run path just works.

# /// script
# requires-python = ">=3.10"
# dependencies = ["strands-transformers[vision]", "numpy"]
# ///
import numpy as np
from PIL import Image
from strands_transformers import use_transformers

scene = np.full((256, 256, 3), 180, dtype=np.uint8)
scene[150:210, 40:100] = (200, 30, 30)            # red cube, lower-left
img = Image.fromarray(scene)

r = use_transformers(action="run", task="image-text-to-text",
                     model="nvidia/Cosmos-Reason2-2B",
                     inputs={"text": [{"role": "user", "content": [
                         {"type": "image", "image": img},
                         {"type": "text", "text": "Where is the red cube and what should the arm do first?"},
                     ]}]},
                     parameters={"max_new_tokens": 96, "do_sample": False})
print(r["content"][0]["text"])
$ uv run reason.py
The red cube is in the bottom left corner of the image, so the robot arm
should move to that location first.
Input Real output
🟥 red cube lower-left + "where is it, what first?" "The red cube is in the bottom left corner of the image, so the robot arm should move to that location first."

⚙️ Act - VLA models (examples/molmoact_vla.py, examples/openvla_vla.py)

VLA models take camera images + an instruction (+ robot state) and emit robot actions via a custom predict_action, driven through the call layer:

# 1) load processor + model once, cache them
use_transformers(action="call", target="AutoProcessor.from_pretrained",
                 parameters={"pretrained_model_name_or_path": REPO, "trust_remote_code": True},
                 cache_key="proc")
use_transformers(action="call", target="AutoModelForImageTextToText.from_pretrained",
                 parameters={"pretrained_model_name_or_path": REPO, "trust_remote_code": True,
                             "dtype": "bfloat16", "device_map": "cuda"}, cache_key="vla")

# 2) call the model's own predict_action with the cached processor
use_transformers(action="call", target="cached:vla.predict_action",
                 parameters={"processor": "cached:proc", "images": [top, side],
                             "state": joint_state, "norm_tag": "so100_so101_molmoact2"})
# → MolmoAct2ActionOutput.actions, shape [1, 30, 6]
Model Output
MolmoAct2-SO100_101 continuous actions [1, 30, 6]
OpenVLA-7b 7-DoF action vector

Ergonomic helpers

  • cached:key[.attr] resolves to live cached objects, including inside parameters (so processor="cached:proc" works).
  • A "**" parameter key unpacks a cached mapping into kwargs - the idiomatic model.predict_action(**processor(prompt, image)).

Beyond transformers (lerobot ecosystem)

Some popular robot policies - SmolVLA, π0 / π0.5, ACT, Diffusion Policy, NVIDIA GR00T-N1.7 - ship as lerobot or Isaac-GR00T checkpoints with their own runtimes, not transformers AutoModel classes. They're out of scope for use_transformers (which wraps transformers), but pair naturally with use_lerobot in the same agent. For transformers-native robotics today, use the three models above.

For models written against older transformers, see Legacy model compatibility.