Skip to content

Strands Sapiens

Strands Sapiens

Give your agent a body. Pixel-perfect human understanding, as tools.

Input โ†’ Segmentation โ†’ Normals
Real output from sapiens_seg and sapiens_normal on a single image (0.4b model, NVIDIA Thor)

Strands Sapiens exposes Meta's Sapiens2 - a family of high-resolution transformers pretrained on 1B human images - as idiomatic Strands Agents tools.

One Python import, every human-centric vision head: body-part segmentation, surface normals, intrinsic albedo, 3D pointmaps, 308-keypoint pose, plus raw pretrained backbone features.


What it gives you

Tool suite overview
  • ๐Ÿงฉ 29-class body-part segmentation

    sapiens_seg(input_path="people/", output_dir="out/", model_size="0.4b")
    

    โ†’ Guide

  • ๐Ÿงญ Surface normals & albedo

    sapiens_normal(input_path="person.jpg", output_dir="out/")
    sapiens_albedo(input_path="person.jpg", output_dir="out/")
    

    โ†’ Normals ยท Albedo

  • ๐ŸŒŒ 3D pointmap (per-pixel 3D)

    sapiens_pointmap(input_path="person.jpg", output_dir="out/")
    

    โ†’ Pointmap guide

  • ๐Ÿฆด 308-keypoint 2D pose

    Face 274 + body + hands + feet.

    sapiens_pose(input_path="person.jpg", output_dir="out/")
    

    โ†’ Pose guide

  • ๐Ÿง  Pretrain backbone features

    Drop-in dense features for RAG / downstream heads.

    sapiens_backbone(image_path="person.jpg", model_size="0.1b")
    

    โ†’ Backbone guide

  • ๐Ÿ” Checkpoint / env discovery

    sapiens_info()   # โ†’ what's available locally
    

    โ†’ Checkpoints

  • ๐ŸŽฌ Video processing

    Frame-by-frame inference on any video file.

    sapiens_video(video_path="dance.mp4", output_dir="out/", task="seg")
    

    โ†’ Video guide


Why

Sapiens2 is the current state-of-the-art for human perception at native high resolution (up to 4096ร—3072). It's one of the highest-signal models ever open-sourced for human-centric understanding - but it ships as a research codebase with CLI scripts, config gymnastics, and manual checkpoint wiring.

Strands Sapiens turns each head into a one-line, agent-callable tool, with structured responses, defensive fallbacks, and compatibility across the upstream's breaking API shuffles.


60-second quickstart

# 1) CUDA PyTorch (platform-specific; e.g. CUDA 12.4)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# 2) Sapiens2
pip install git+https://github.com/facebookresearch/sapiens2.git

# 3) This wrapper
pip install git+ssh://git@github.com/cagataycali/strands-sapiens.git

# 4) Drop a checkpoint into ~/sapiens2_host/seg/sapiens2_0.4b_seg.safetensors
export SAPIENS_CHECKPOINT_ROOT=~/sapiens2_host

# 5) Go
python -c "from strands_sapiens import sapiens_info; print(sapiens_info())"

๐Ÿš€ Full installation โ†’ ๐Ÿ“– API reference โ†’


Use from a Strands agent

from strands import Agent
from strands_sapiens import TOOLS   # list of @tool

agent = Agent(tools=TOOLS)
agent("Segment every person in /data/photos and save to /data/out")
agent("Run 308-kpt pose on /data/photos/jump.jpg")
agent("What Sapiens2 checkpoints do I have available locally?")

Every tool returns a structured dict:

{
  "status":   "success" | "error",
  "message":  "...",
  "outputs":  [...],      # per-image entries
  "checkpoint": "...",
  # task-specific keys
}

Verified environment

Tested on NVIDIA Thor (JetPack 6, aarch64) with CUDA PyTorch 2.7+ and the sapiens2_0.4b_seg / sapiens2_0.1b_pretrain checkpoints.

Python 3.10+ (Thor's default 3.10 works; newer also fine).


Built on top of Meta's Sapiens2, powered by Strands Agents.