Architecture¶
How strands-cosmos is structured internally.
Package Structure¶
strands_cosmos/
├── __init__.py # Exports: CosmosModel, CosmosVisionModel, tools
├── cosmos_model.py # Text-only model (Strands Model interface)
├── cosmos_vision_model.py # Vision model (video + image + text)
├── fix_cublas.py # Jetson CUBLAS compatibility fix
└── tools/
├── __init__.py # Tool exports
├── cosmos_invoke.py # Text inference tool (@tool decorated)
└── cosmos_vision_invoke.py # Vision inference tool (@tool decorated)
Model Hierarchy¶
graph TD
SM["strands.models.Model<br/><i>Abstract base class</i>"] --> CM["CosmosModel<br/><i>Text-only</i>"]
SM --> CVM["CosmosVisionModel<br/><i>Video + Image + Text</i>"]
CVM --> Q["Qwen3VLForConditionalGeneration<br/><i>HuggingFace Transformers</i>"]
CM --> Q
Q --> GPU["🖥️ NVIDIA GPU<br/>CUDA inference"]
style SM fill:#264653,color:#fff
style CVM fill:#76b900,color:#fff
style CM fill:#76b900,color:#fff
Data Flow¶
sequenceDiagram
participant User
participant Agent as Strands Agent
participant Model as CosmosVisionModel
participant HF as Transformers
participant GPU as CUDA
User->>Agent: agent("caption: <video>file.mp4</video>")
Agent->>Model: format_request(messages)
Model->>Model: Parse <video>/<image> tags
Model->>HF: processor(text, images, videos)
HF->>GPU: input_ids + pixel_values
GPU->>HF: logits (autoregressive)
HF->>Model: generated tokens
Model->>Agent: format_response(stream_events)
Agent->>User: Result text
Two Usage Modes¶
graph TD
subgraph "Mode 1: As the Model"
A1["Agent(model=CosmosVisionModel())"] --> B1["Cosmos IS the agent's brain"]
end
subgraph "Mode 2: As a Tool"
A2["Agent(tools=[cosmos_vision_invoke])"] --> B2["Cosmos is a tool<br/>called by another model"]
end
style B1 fill:#76b900,color:#fff
style B2 fill:#264653,color:#fff
Strands Model Interface¶
CosmosVisionModel implements the full Strands Model interface:
| Method | Purpose |
|---|---|
update_config() |
Merge user config |
get_config() |
Return current config |
format_request() |
Convert messages → HF inputs |
format_chunk() |
Stream tokens → StreamEvents |
format_response() |
Finalize response metadata |
Configuration¶
CosmosVisionModel(
# Model selection
model_id="nvidia/Cosmos-Reason2-2B", # HuggingFace ID
# GPU settings
device_map="auto", # GPU placement
torch_dtype="auto", # float16 / bfloat16
# Vision settings
fps=4, # Video frame sampling rate
min_vision_tokens=256, # Min visual tokens per frame
max_vision_tokens=8192, # Max visual tokens per frame
# Reasoning
reasoning=True, # Enable <think> CoT
# Generation
params={
"max_tokens": 4096,
"temperature": 0.6,
"top_p": 0.95,
},
)