Home
π€ transformers Β· 𧬠strands agents
Run any HuggingFace transformers model from a Strands agent - as a tool for all 24 tasks, or as the agent's own multimodal brain. Local, no API keys.
# /// script
# dependencies = ["strands-transformers[vision]"]
# ///
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel
png = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(png, "PNG")
agent = Agent(model=TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct"))
print(agent([{"image": {"format": "png", "source": {"bytes": png.getvalue()}}},
{"text": "Color? One word."}]))
A 256M vision-language model, wired into the standard agent loop, seeing pixels - in a dozen lines. Run any snippet in these docs the same way.
Why this exists¶
Wiring HuggingFace into an agent usually means bespoke glue for every model -
the right AutoModel class, processor, device, dtype, and a custom decode for
each task. It rots the moment transformers ships a new task or you swap models.
strands-transformers deletes that work. It reads transformers' own task taxonomy
at runtime (SUPPORTED_TASKS),
so nothing is hardcoded per task: a new task or model upstream just works here,
no code change. You get one way in - through an agent.
Two entry points¶
-
use_transformers- the tool
One tool exposing every transformers task. Discover the taxonomy at runtime, run a pipeline, or call any class/method directly. New model or task upstream β works here with no code change.
-
TransformerModel- the brain
Plug a local HF model in as
Agent(model=β¦). It speaks the agent content-block protocol -image,video,audio,documentreach the model directly. VLMs see, audio models hear, Qwen2.5-Omni speaks back.
See it - and hear it¶
Every result below is real model output, reproducible from the linked example.
| You send | Model | You get back |
|---|---|---|
| πΌοΈ green image + "Color?" | SmolVLM | "Green." |
| π¬ brightening frames | SmolVLM2-Video | "BRIGHTER." |
| π§° tool returns a screenshot | SmolVLM | "Blue." |
| π a text document | Qwen3 | recovers BANANA-42 |
| π a 440 Hz tone | Qwen2.5-Omni | "It's a pure tone." |
| π¦Ύ camera + "pick the cube" | MolmoAct2 | actions [1,30,6] |
Hear the library explain itself - generated by text-to-audio, then
re-transcribed by whisper to prove it's real speech:
Real agent outputs - detection Β· depth Β· panoptic segmentation:

π¬ video understanding Β· π generated speech (the library narrating itself):

Run any snippet¶
Code blocks here are self-contained PEP 723
scripts - dependencies live in the header. Save one as demo.py:
uv builds a throwaway env with the right extras and runs it. No pip install,
no venv to manage.
What you can build¶
- π£οΈ Voice assistant - speak to it, it speaks back (Qwen2.5-Omni), one local model.
- π€ Robot controller - camera + instruction β joint actions (MolmoAct, OpenVLA).
- ποΈ Screen-watcher - a tool returns a screenshot; the VLM reasons over it.
- π Document Q&A - drop a doc block in the conversation, ask about it.
- π¬ Video understander - pass frames, ask what changes over time.
- π Any HF task - ASR, detection, segmentation, embeddingsβ¦ via one tool.
Where next¶
- New here β Installation Β· Quickstart
- See & hear β Content blocks Β· Audio
- Robotics β Robotics / VLA
- Internals β Architecture Β· API