Skip to content

Home

πŸ€— transformers Β· 🧬 strands agents

Strands Transformers

Run any HuggingFace transformers model from a Strands agent - as a tool for all 24 tasks, or as the agent's own multimodal brain. Local, no API keys.

Every modality in, every modality out - one tool, one local brain, zero hardcoding

# /// script
# dependencies = ["strands-transformers[vision]"]
# ///
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

png = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(png, "PNG")
agent = Agent(model=TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct"))
print(agent([{"image": {"format": "png", "source": {"bytes": png.getvalue()}}},
             {"text": "Color? One word."}]))
$ uv run hello.py
Green.

A 256M vision-language model, wired into the standard agent loop, seeing pixels - in a dozen lines. Run any snippet in these docs the same way.

Why this exists

Wiring HuggingFace into an agent usually means bespoke glue for every model - the right AutoModel class, processor, device, dtype, and a custom decode for each task. It rots the moment transformers ships a new task or you swap models.

strands-transformers deletes that work. It reads transformers' own task taxonomy at runtime (SUPPORTED_TASKS), so nothing is hardcoded per task: a new task or model upstream just works here, no code change. You get one way in - through an agent.

Two entry points

  •   use_transformers - the tool


    One tool exposing every transformers task. Discover the taxonomy at runtime, run a pipeline, or call any class/method directly. New model or task upstream β‡’ works here with no code change.

    The tool

  •   TransformerModel - the brain


    Plug a local HF model in as Agent(model=…). It speaks the agent content-block protocol - image, video, audio, document reach the model directly. VLMs see, audio models hear, Qwen2.5-Omni speaks back.

    The agent brain

See it - and hear it

Every result below is real model output, reproducible from the linked example.

You send Model You get back
πŸ–ΌοΈ green image + "Color?" SmolVLM "Green."
🎬 brightening frames SmolVLM2-Video "BRIGHTER."
🧰 tool returns a screenshot SmolVLM "Blue."
πŸ“„ a text document Qwen3 recovers BANANA-42
πŸ”Š a 440 Hz tone Qwen2.5-Omni "It's a pure tone."
🦾 camera + "pick the cube" MolmoAct2 actions [1,30,6]

Hear the library explain itself - generated by text-to-audio, then re-transcribed by whisper to prove it's real speech:

Real agent outputs - detection Β· depth Β· panoptic segmentation:

🎬 video understanding  Β·  πŸ”Š generated speech (the library narrating itself):

 

Run any snippet

Code blocks here are self-contained PEP 723 scripts - dependencies live in the header. Save one as demo.py:

$ uv run demo.py

uv builds a throwaway env with the right extras and runs it. No pip install, no venv to manage.

What you can build

  • πŸ—£οΈ Voice assistant - speak to it, it speaks back (Qwen2.5-Omni), one local model.
  • πŸ€– Robot controller - camera + instruction β†’ joint actions (MolmoAct, OpenVLA).
  • πŸ‘οΈ Screen-watcher - a tool returns a screenshot; the VLM reasons over it.
  • πŸ“„ Document Q&A - drop a doc block in the conversation, ask about it.
  • 🎬 Video understander - pass frames, ask what changes over time.
  • πŸ”Œ Any HF task - ASR, detection, segmentation, embeddings… via one tool.

Where next