Image Reasoning¶
Cosmos-Reason2 processes single images for object recognition, spatial reasoning, and embodied intelligence.
See It In Action¶

📺 Can't see the animation? Download MP4
Basic Image Analysis¶
from strands import Agent
from strands_cosmos import CosmosVisionModel
model = CosmosVisionModel(model_id="nvidia/Cosmos-Reason2-2B")
agent = Agent(model=model)
agent("<image>workspace.jpg</image> Describe what you see.")
Embodied Reasoning (Robot Vision)¶
model = CosmosVisionModel(
model_id="nvidia/Cosmos-Reason2-2B",
reasoning=True,
)
agent = Agent(model=model)
agent("""<image>robot_view.jpg</image>
Given this view from a bimanual robot's workspace:
What is the immediate next action the robot should take?""")
The model will reason through <think>...</think> tags before providing the action.
→ Full embodied reasoning example
2D Grounding¶
Cosmos can localize objects with bounding box coordinates:
agent("""<image>kitchen.jpg</image>
Locate the red cup in this image. Provide bounding box coordinates.""")
Image Format Support¶
Images are processed via the Qwen3-VL processor:
| Format | Supported |
|---|---|
| JPEG / JPG | ✅ |
| PNG | ✅ |
| WebP | ✅ |
| BMP | ✅ |
Visual Token Configuration¶
model = CosmosVisionModel(
min_vision_tokens=256, # Minimum visual detail
max_vision_tokens=8192, # Maximum visual detail
)
Higher max_vision_tokens = more detail at the cost of memory and speed.
What's Next¶
- Video Understanding — Multi-frame temporal analysis
- Chain-of-Thought — Step-by-step reasoning