Pose (308 keypoints)¶
Top-down 2D pose estimation with 308 keypoints: 274 face points + body + hands + feet.
Uses the DETR-ResNet-101-DC5 person detector for bbox proposals, then Sapiens2 pose heads per crop.
Signature¶
sapiens_pose(
input_path: str,
output_dir: str,
model_size: str = "0.4b",
device: str = "cuda:0",
kpt_thres: float = 0.3,
line_thickness: int = 2,
radius: int = 3,
) -> dict
Requirements¶
You need two checkpoints for this tool:
$SAPIENS_CHECKPOINT_ROOT/pose/sapiens2_<size>_pose.safetensors$SAPIENS_CHECKPOINT_ROOT/detector/detr-resnet-101-dc5/(HuggingFace model dir)
Run sapiens_info() to confirm both are present:
result = sapiens_info()
# check result["content"][1]["json"]
info = result["content"][1]["json"]
assert "pose" in info["available"]
assert info["detector_present"]
Download the detector:
huggingface-cli download facebook/detr-resnet-101-dc5 \
--local-dir $SAPIENS_CHECKPOINT_ROOT/detector/detr-resnet-101-dc5
Example¶
from strands_sapiens import sapiens_pose
sapiens_pose(
input_path="dance.jpg",
output_dir="out/",
model_size="0.4b",
kpt_thres=0.3,
)
Output per image:
out/dance.jpg- skeleton + keypoints overlayout/dance.json- structured detections
Example JSON shape (may vary slightly by upstream version):
{
"instances": [
{
"bbox": [x1, y1, x2, y2],
"score": 0.97,
"keypoints": [[x, y], ...], // 308 points
"keypoint_scores": [0.91, ...] // 308 confidences
}
]
}
Compatibility strategy¶
sapiens_pose tries three integration paths in order, so it keeps working across upstream Sapiens2 refactors:
sapiens.pose.inference.Inferencer(high-level API).sapiens.pose.models.init_pose_model+PoseVisualizer.- If neither works, returns an error pointing you at the upstream shell script with the correct paths.
You can see which path ran from the response "api" field.
Tips¶
kpt_thres: lower it (0.1–0.2) if you're post-processing; raise to 0.5+ for crisp overlays.- Face-only use: the 274 face keypoints form a dense landmark grid - useful for head-pose tracking, gaze, expression.
- Performance: RTMDet-m dominates latency for small crops; consider running detection once per video frame and pose across frames.
Related¶
- Segmentation → - combine seg + pose for per-limb attention.