Embodied Reasoning — Robot Vision¶

Robot next-action prediction from workspace images using chain-of-thought reasoning.

Terminal Recording¶

Embodied robot reasoning demo

📺 Can't see the animation? Download MP4

View full output

$ python examples/04_embodied_reasoning.py
=== 04: Embodied Reasoning ===
Loading nvidia/Cosmos-Reason2-2B (vision, reasoning=True)... ✅ loaded
Processing image: sample.png

Agent:
<think>
I see a bimanual robot workspace from a top-down camera view.

In the workspace I can identify:
- A red cube near the center of the table
- A blue bin on the right side
- The robot's left gripper is open and positioned
  approximately 10cm above the red cube
- The right gripper is in a neutral position

Given the spatial layout, the most logical next action
is to lower the left gripper to grasp the red cube.
</think>

The robot should lower the left gripper to grasp the red cube.
Steps: descend → close gripper → lift → move to blue bin → release.

Time: 43.1s
=== PASS ===

Play locally: asciinema play docs/assets/casts/04_embodied_reasoning.cast

Code¶

examples/04_embodied_reasoning.py

from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-2B",
    reasoning=True,
    params={"max_tokens": 2048, "temperature": 0.6},
)
agent = Agent(model=model)

result = agent(
    "<image>sample.png</image> "
    "What can be the next immediate action?"
)

Robot Vision Pipeline¶

graph TD
    CAM["📷 Robot Camera"] --> IMG["Workspace Image"]
    IMG --> VT["Visual Tokenizer<br/>Object detection"]
    VT --> COT["&lt;think&gt;<br/>Spatial reasoning<br/>Object identification<br/>Gripper state analysis<br/>&lt;/think&gt;"]
    COT --> ACT["🤖 Next Action<br/>grasp / move / release"]

    style CAM fill:#264653,color:#fff
    style COT fill:#fff3cd,stroke:#ffc107,color:#333
    style ACT fill:#76b900,color:#fff

Built-in Task Prompts for Robotics¶

from strands_cosmos.cosmos_vision_model import TASK_PROMPTS

# Next-action prediction
TASK_PROMPTS["embodied_reasoning"]
# → "What can be the next immediate action?"

# Step-by-step robot planning with trajectory
TASK_PROMPTS["robot_cot"]
# → 'You are given the task "{task_instruction}". Specify
#    the 2D trajectory your end effector should follow...'

# 2D object grounding
TASK_PROMPTS["2d_grounding"]
# → 'Locate the bounding box of {object_name}. Return a json.'

Capabilities¶

Task	Description	Example
Next-action	What should the robot do right now?	Grasp, move, release
Spatial reasoning	Where are objects relative to gripper?	"10cm above the cube"
Trajectory planning	2D pixel path for end effector	JSON coordinates
Object grounding	Bounding boxes for named objects	`{"x": 120, "y": 80, "w": 40, "h": 40}`
Scene understanding	Full workspace description	Objects, tools, layout

Integration with Robot Control¶

graph LR
    COSMOS["🌌 Cosmos<br/>(Vision Reasoning)"] -->|"next action"| POLICY["🧠 Policy<br/>(strands-robots)"]
    POLICY -->|"joint commands"| ROBOT["🦾 Robot<br/>(real / sim)"]
    ROBOT -->|"camera feed"| COSMOS

    style COSMOS fill:#76b900,color:#fff
    style POLICY fill:#264653,color:#fff
    style ROBOT fill:#831843,color:#fff

Combine with strands-robots

Use Cosmos for high-level reasoning and strands-robots for low-level robot control:

from strands import Agent
from strands_cosmos import cosmos_vision_invoke
from strands_robots import Robot

# Cosmos reasons about what to do
# strands-robots executes the action
agent = Agent(tools=[cosmos_vision_invoke])

→ Next: Tool Usage | All Examples