Spaces:
Running
title: Sapiens2 CPU
emoji: 🧍
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: other
Sapiens2 CPU
Meta's facebook/sapiens2-* running on free HF CPU. 15 variants exposed: seg, normal, pointmap, pose across 0.4b, 0.8b, 1b, plus seg-5b, normal-5b, pointmap-5b as INT8 ONNX. Curl-callable with a Bearer token.
Variants and inference time on the included 6000×4000 demo image
| Task | Notes | 0.4b | 0.8b | 1b | 5b (INT8 ONNX) |
|---|---|---|---|---|---|
| seg | DOME 29-class body parts | 57 s | 74 s | 208 s | 189 s |
| normal | per-pixel surface normals | 72 s | 84 s | 206 s | 359 s |
| pointmap | per-pixel XYZ in meters | 78 s | 99 s | 274 s | 386 s |
| pose | DETR detect, 308 keypoints | 47 s | 68 s | 232 s | not shipped |
Verified 15/15 via Gradio API on 2026-05-12. Times include first-call downloads.
0.4b through 1b run as fp32 PyTorch. 5B runs as INT8 ONNX (5 to 6 GB on disk; fp32 5B would need ~20 GB RAM, more than the free tier provides). Dense 0.4b/0.8b share an LRU(2) cache. Loading any 1B variant hard-clears all model caches (dense + pose + ORT) since 16 GB cpu-basic cannot fit two 1B-class models simultaneously. Pose has its own slot and DETR (facebook/detr-resnet-50) is sticky-loaded once.
5B chain limitation: calling a 5B variant right after another 5B variant on the same Space instance OOMs. ONNX Runtime's C++ session shutdown is not synchronous with the Python _ORT_SESSIONS.clear() call, so loading the next 5B session before the previous one's worker threads exit peaks RAM above 16 GB. If you need to benchmark multiple 5B variants, factory-restart the Space (Settings → Factory restart) between calls, or run one variant per cold Space.
The model fixes a 1024×768 input tensor (NCHW with H=1024, W=768, a portrait canvas in Meta's convention). Any input is aspect-preserve resized then padded to that.
CPU-friendly ONNX exports
Companion repo: WeReCooking/sapiens2-onnx (public). Files live in per-task folders seg/, normal/, pointmap/, pose/. Each variant is <task>/<task>_<size>_<precision>.onnx plus a .onnx.data external sidecar. 15 ONNX artifacts shipped: 12 covering 0.4b/0.8b/1b (fp16 for seg-0.4b, fp32 for the rest), and 3 new 5B int8 files (seg, normal, pointmap). Cosine similarity vs PyTorch fp32 is 0.999 or better on all shipped variants.
Turnkey CLI built into app.py (no sapiens2 / PyTorch dep needed; install requirements.txt):
export HF_TOKEN=hf_xxx
python app.py onnx seg 0.4b photo.jpg --output seg_overlay.png
python app.py onnx normal 1b photo.jpg --output normals.png
python app.py onnx pointmap 0.8b photo.jpg --output depth.png
python app.py onnx pose 0.4b photo.jpg --output pose.png
python app.py onnx seg 5b photo.jpg --output seg_5b.png
Curl tests
TOKEN="hf_xxx"
SPACE="https://werecooking-sapiens2-cpu.hf.space"
IMG="https://huggingface.co/spaces/facebook/sapiens2-seg/resolve/main/assets/images/pexels-alex-green-5699868.jpg"
EVT=$(curl -s -X POST "$SPACE/gradio_api/call/predict" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d "{\"data\":[{\"path\":\"$IMG\",\"meta\":{\"_type\":\"gradio.FileData\"}},\"seg\",\"0.4b\"]}" \
| python -c "import sys,json;print(json.load(sys.stdin)['event_id'])")
curl -sN "$SPACE/gradio_api/call/predict/$EVT" -H "Authorization: Bearer $TOKEN"
Logs (SSE)
curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/build"
curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/run"
5B INT8 ONNX conversion recipe
The dense 5B variants ship as INT8 ONNX. To re-run the pipeline:
- Export fp16 ONNX using lazy fp16 init. Call
torch.set_default_dtype(torch.float16)beforeinit_model(cfg, None, device="cpu"), then stream the safetensors file tensor by tensor into the empty fp16 model. This avoids the ~22 GB fp32 init that OOMs on a 32 GB box. Export withopset_version=18and nodynamic_axes. Forcesys.stdout.reconfigure(encoding="utf-8")so torch.onnx's success print does not crash on Windows cp1252. - Stream cast fp16 to fp32 on disk via
onnx.external_data_helper.load_external_data_for_modelplus per-tensornumpy_helper. Peak RAM stays close to a single tensor (~250 MB). Drop Cast(fp16 / fp32) nodes with transitive rename closure so consumers point at the original input. - Run
quantize.shape_inference.quant_pre_process(skip_onnx_shape=True, skip_optimization=True). This routes through ORT symbolic shape inference which understands sapiens2 windowed attention. Vanillaonnx.shape_inferenceerrors with(6144) vs (512)on the pointmap and normal heads. quantize_dynamic(weight_type=QuantType.QInt8, per_channel=True, op_types_to_quantize=["MatMul"], use_external_data_format=True). This lowers toMatMulIntegerToFloat, which accepts fp32 input and has no 2D-only filter (unlikeMatMulNBitsQuantizerwhich silently skips 3D packed-QKV weights).
Pose-5b is not shipped. It uses a different forward signature (single person bbox cropped tensor) and the int8 quantize attempt did not complete on the available hardware.
Files
app.pyeverything: Gradio Space UI, PyTorch dispatch for 0.4b/0.8b/1b, ORT for 5B, inlined keypoint visualization, plus thepython app.py onnx ...CLIrequirements.txtPython deps includingsapiens @ git+https://github.com/facebookresearch/sapiens2.gitpackages.txtapt deps (libgl1,libglib2.0-0) installed by the Gradio SDK at build time