Skip to content

Content blocks & modalities

TransformerModel consumes the full Strands content-block taxonomy. Every output below is a real model result (CUDA Β· transformers 5.12 Β· torch 2.10), reproducible from the matching example.

flowchart TB
    subgraph CB["content blocks β†’ handler"]
        direction LR
        T["πŸ“ text"] --> HT["tokenizer fast-path"]
        I["πŸ–ΌοΈ image"] --> HI["AutoProcessor πŸ‘"]
        V["🎬 video"] --> HV["processor + VideoMetadata (fps)"]
        TR["🧰 toolResult(image)"] --> HI
        D["πŸ“„ document"] --> HD["flatten β†’ text"]
        AU["πŸ”Š audio*"] --> HA["feature_extractor πŸ”Š"]
        OM["πŸ”Š audio in/out"] --> HO["Omni Thinker+Talker"]
    end
    classDef blk fill:#7C5CFF22,stroke:#7C5CFF,stroke-width:1.5px,color:#7C5CFF;
    classDef h fill:#22D3EE1f,stroke:#22D3EE,stroke-width:1.5px,color:#0F91A6;
    class T,I,V,TR,D,AU,OM blk;
    class HT,HI,HV,HD,HA,HO h;
* audio is our extension to the Strands taxonomy - see Audio.

Example responses

Block Input Script Real output
image "Color? One word." multimodal_agent.py "Green."
video 8 frames dark→bright (fps=2.0) multimodal_advanced.py "BRIGHTER."
image in toolResult tool returns multimodal_advanced.py "Blue."
document txt "…passphrase is BANANA-42…" document_and_audio.py recovers BANANA-42
audio 440 Hz tone (Omni) omni_audio.py "It's a pure tone."

Media you can feed it

These are real artifacts - a TTS clip and an MP4 - that round-trip through the library (the video decodes to 24 frames @ 12 fps; the audio re-transcribes intelligibly):

🎬 video (mp4 / gif)πŸ”Š audio (TTS, wav)
# /// script
# requires-python = ">=3.10"
# dependencies = ["strands-transformers[vision]", "imageio[ffmpeg]"]
# ///
from strands_transformers import use_transformers

# audio out - text-to-speech writes a .wav artifact
tts = use_transformers(action="run", task="text-to-audio",
                       model="facebook/mms-tts-eng", inputs="hello from strands")
print("audio:", tts["artifacts"][0])

# video in - classify a clip (frame list is auto-stacked to (T,H,W,C))
import numpy as np
frames = [np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8) for _ in range(16)]
vid = use_transformers(action="run", task="video-classification",
                       model="MCG-NJU/videomae-base-finetuned-kinetics", inputs=frames)
print("top label:", vid["data"][0]["label"])

πŸ–ΌοΈ Image

result = agent([
    {"image": {"format": "png", "source": {"bytes": png_bytes}}},
    {"text": "What color is this image? One word."},
])
Green.

🎬 Video

A video block is a list of frames (or a (T,H,W,C) array / container bytes). Provide fps so the model builds real frame timestamps.

model.stream([{"role": "user", "content": [
    {"video": {"format": "mp4", "fps": 2.0, "source": {"bytes": frames}}},
    {"text": "Does this video get brighter or darker?"},
]}])
BRIGHTER.

🧰 Tool-result images (the agentic-vision loop)

A tool returns an image inside a toolResult; the VLM reasons over it on the next turn - exactly the loop you want for screen-watchers and camera agents.

{"toolResult": {"toolUseId": "t1", "status": "success", "content": [
    {"text": "Here is the captured screen:"},
    {"image": {"format": "png", "source": {"bytes": blue_png}}},
]}}
Blue.

πŸ“„ Document

{"document": {"name": "secret", "format": "txt",
              "source": {"bytes": b"...the passphrase is BANANA-42..."}}}
# "What is the passphrase?"
BANANA-42

πŸ”Š Audio

See Audio (in & out) - with playable real outputs.

Supported transformers modalities (the tool)

Modality Example tasks
text text-generation, fill-mask, token/text-classification, feature-extraction, table-qa
image image-classification, depth-estimation, image-feature-extraction, keypoint-matching
audio automatic-speech-recognition, audio-classification, text-to-audio
video video-classification
multimodal image-text-to-text, visual/document-qa, object-detection, segmentation, zero-shot-*, any-to-any

Run use_transformers(action="tasks") for the live, complete list on your install.