Skip to content

Image Reasoning

Cosmos-Reason2 processes single images for object recognition, spatial reasoning, and embodied intelligence.


See It In Action

Embodied robot reasoning from image

📺 Can't see the animation? Download MP4

Basic Image Analysis

from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)

agent("<image>workspace.jpg</image> Describe what you see.")

Embodied Reasoning (Robot Vision)

model = CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-2B",
    reasoning=True,
)
agent = Agent(model=model)

agent("""<image>robot_view.jpg</image>
Given this view from a bimanual robot's workspace:
What is the immediate next action the robot should take?""")

The model will reason through <think>...</think> tags before providing the action.

Full embodied reasoning example

2D Grounding

Cosmos can localize objects with bounding box coordinates:

agent("""<image>kitchen.jpg</image>
Locate the red cup in this image. Provide bounding box coordinates.""")

Image Format Support

Images are processed via the Qwen3-VL processor:

Format Supported
JPEG / JPG
PNG
WebP
BMP

Visual Token Configuration

model = CosmosVisionModel(
    min_vision_tokens=256,    # Minimum visual detail
    max_vision_tokens=8192,   # Maximum visual detail
)

Higher max_vision_tokens = more detail at the cost of memory and speed.


What's Next