Image Reasoning¶

Cosmos-Reason2 processes single images for object recognition, spatial reasoning, and embodied intelligence.

See It In Action¶

Embodied robot reasoning from image

📺 Can't see the animation? Download MP4

Basic Image Analysis¶

from strands import Agent
from strands_cosmos import CosmosVisionModel

model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)

agent("<image>workspace.jpg</image> Describe what you see.")

Embodied Reasoning (Robot Vision)¶

model = CosmosVisionModel(
    model_id="nvidia/Cosmos-Reason2-2B",
    reasoning=True,
)
agent = Agent(model=model)

agent("""<image>robot_view.jpg</image>
Given this view from a bimanual robot's workspace:
What is the immediate next action the robot should take?""")

The model will reason through <think>...</think> tags before providing the action.

→ Full embodied reasoning example

2D Grounding¶

Cosmos can localize objects with bounding box coordinates:

agent("""<image>kitchen.jpg</image>
Locate the red cup in this image. Provide bounding box coordinates.""")

Image Format Support¶

Images are processed via the Qwen3-VL processor:

Format	Supported
JPEG / JPG	✅
PNG	✅
WebP	✅
BMP	✅

Visual Token Configuration¶

model = CosmosVisionModel(
    min_vision_tokens=256,    # Minimum visual detail
    max_vision_tokens=8192,   # Maximum visual detail
)

Higher max_vision_tokens = more detail at the cost of memory and speed.

What's Next¶

Video Understanding — Multi-frame temporal analysis
Chain-of-Thought — Step-by-step reasoning