Home

🤗 transformers · 🧬 strands agents

Strands Transformers

Run any HuggingFace transformers model from a Strands agent - as a tool for all 24 tasks, or as the agent's own multimodal brain. Local, no API keys.

Quickstart Explore the tool

Every modality in, every modality out - one tool, one local brain, zero hardcoding

# /// script
# dependencies = ["strands-transformers[vision]"]
# ///
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

png = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(png, "PNG")
agent = Agent(model=TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct"))
print(agent([{"image": {"format": "png", "source": {"bytes": png.getvalue()}}},
             {"text": "Color? One word."}]))

$ uv run hello.py
Green.

A 256M vision-language model, wired into the standard agent loop, seeing pixels - in a dozen lines. Run any snippet in these docs the same way.

Why this exists¶

Wiring HuggingFace into an agent usually means bespoke glue for every model - the right AutoModel class, processor, device, dtype, and a custom decode for each task. It rots the moment transformers ships a new task or you swap models.

strands-transformers deletes that work. It reads transformers' own task taxonomy at runtime (SUPPORTED_TASKS), so nothing is hardcoded per task: a new task or model upstream just works here, no code change. You get one way in - through an agent.

Two entry points¶

use_transformers - the tool

One tool exposing every transformers task. Discover the taxonomy at runtime, run a pipeline, or call any class/method directly. New model or task upstream ⇒ works here with no code change.

The tool
TransformerModel - the brain

Plug a local HF model in as Agent(model=…). It speaks the agent content-block protocol - image, video, audio, document reach the model directly. VLMs see, audio models hear, Qwen2.5-Omni speaks back.

The agent brain

See it - and hear it¶

Every result below is real model output, reproducible from the linked example.

You send	Model	You get back
🖼️ green image + "Color?"	SmolVLM	`"Green."`
🎬 brightening frames	SmolVLM2-Video	`"BRIGHTER."`
🧰 tool returns a screenshot	SmolVLM	`"Blue."`
📄 a text document	Qwen3	recovers `BANANA-42`
🔊 a 440 Hz tone	Qwen2.5-Omni	`"It's a pure tone."`
🦾 camera + "pick the cube"	MolmoAct2	actions `[1,30,6]`

Hear the library explain itself - generated by text-to-audio, then re-transcribed by whisper to prove it's real speech:

Real agent outputs - detection · depth · panoptic segmentation:

🎬 video understanding · 🔊 generated speech (the library narrating itself):

Run any snippet¶

Code blocks here are self-contained PEP 723 scripts - dependencies live in the header. Save one as demo.py:

$ uv run demo.py

uv builds a throwaway env with the right extras and runs it. No pip install, no venv to manage.

What you can build¶

🗣️ Voice assistant - speak to it, it speaks back (Qwen2.5-Omni), one local model.
🤖 Robot controller - camera + instruction → joint actions (MolmoAct, OpenVLA).
👁️ Screen-watcher - a tool returns a screenshot; the VLM reasons over it.
📄 Document Q&A - drop a doc block in the conversation, ask about it.
🎬 Video understander - pass frames, ask what changes over time.
🔌 Any HF task - ASR, detection, segmentation, embeddings… via one tool.

Where next¶

New here → Installation · Quickstart
See & hear → Content blocks · Audio
Robotics → Robotics / VLA
Internals → Architecture · API