Content blocks & modalities¶
TransformerModel consumes the full Strands content-block taxonomy. Every output
below is a real model result (CUDA Β· transformers 5.12 Β· torch 2.10),
reproducible from the matching example.
flowchart TB
subgraph CB["content blocks β handler"]
direction LR
T["π text"] --> HT["tokenizer fast-path"]
I["πΌοΈ image"] --> HI["AutoProcessor π"]
V["π¬ video"] --> HV["processor + VideoMetadata (fps)"]
TR["π§° toolResult(image)"] --> HI
D["π document"] --> HD["flatten β text"]
AU["π audio*"] --> HA["feature_extractor π"]
OM["π audio in/out"] --> HO["Omni Thinker+Talker"]
end
classDef blk fill:#7C5CFF22,stroke:#7C5CFF,stroke-width:1.5px,color:#7C5CFF;
classDef h fill:#22D3EE1f,stroke:#22D3EE,stroke-width:1.5px,color:#0F91A6;
class T,I,V,TR,D,AU,OM blk;
class HT,HI,HV,HD,HA,HO h;
* audio is our extension to the Strands taxonomy - see Audio.
Example responses¶
| Block | Input | Script | Real output |
|---|---|---|---|
image |
multimodal_agent.py |
"Green." |
|
video |
8 frames darkβbright (fps=2.0) |
multimodal_advanced.py |
"BRIGHTER." |
image in toolResult |
tool returns |
multimodal_advanced.py |
"Blue." |
document |
txt "β¦passphrase is BANANA-42β¦" | document_and_audio.py |
recovers BANANA-42 |
audio |
440 Hz tone (Omni) | omni_audio.py |
"It's a pure tone." |
Media you can feed it¶
These are real artifacts - a TTS clip and an MP4 - that round-trip through the library (the video decodes to 24 frames @ 12 fps; the audio re-transcribes intelligibly):
| π¬ video (mp4 / gif) | π audio (TTS, wav) |
|---|---|
![]() |
# /// script
# requires-python = ">=3.10"
# dependencies = ["strands-transformers[vision]", "imageio[ffmpeg]"]
# ///
from strands_transformers import use_transformers
# audio out - text-to-speech writes a .wav artifact
tts = use_transformers(action="run", task="text-to-audio",
model="facebook/mms-tts-eng", inputs="hello from strands")
print("audio:", tts["artifacts"][0])
# video in - classify a clip (frame list is auto-stacked to (T,H,W,C))
import numpy as np
frames = [np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8) for _ in range(16)]
vid = use_transformers(action="run", task="video-classification",
model="MCG-NJU/videomae-base-finetuned-kinetics", inputs=frames)
print("top label:", vid["data"][0]["label"])
πΌοΈ Image¶
result = agent([
{"image": {"format": "png", "source": {"bytes": png_bytes}}},
{"text": "What color is this image? One word."},
])
π¬ Video¶
A video block is a list of frames (or a (T,H,W,C) array / container bytes).
Provide fps so the model builds real frame timestamps.
model.stream([{"role": "user", "content": [
{"video": {"format": "mp4", "fps": 2.0, "source": {"bytes": frames}}},
{"text": "Does this video get brighter or darker?"},
]}])
π§° Tool-result images (the agentic-vision loop)¶
A tool returns an image inside a toolResult; the VLM reasons over it on the
next turn - exactly the loop you want for screen-watchers and camera agents.
{"toolResult": {"toolUseId": "t1", "status": "success", "content": [
{"text": "Here is the captured screen:"},
{"image": {"format": "png", "source": {"bytes": blue_png}}},
]}}
π Document¶
{"document": {"name": "secret", "format": "txt",
"source": {"bytes": b"...the passphrase is BANANA-42..."}}}
# "What is the passphrase?"
π Audio¶
See Audio (in & out) - with playable real outputs.
Supported transformers modalities (the tool)¶
| Modality | Example tasks |
|---|---|
| text | text-generation, fill-mask, token/text-classification, feature-extraction, table-qa |
| image | image-classification, depth-estimation, image-feature-extraction, keypoint-matching |
| audio | automatic-speech-recognition, audio-classification, text-to-audio |
| video | video-classification |
| multimodal | image-text-to-text, visual/document-qa, object-detection, segmentation, zero-shot-*, any-to-any |
Run use_transformers(action="tasks") for the live, complete list on your install.
