sapiens2-cpu / README.md
Nekochu's picture
README: verified 15/15 times, document 5B chain OOM
cb4d7d1
---
title: Sapiens2 CPU
emoji: 🧍
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: other
---
# Sapiens2 CPU
Meta's `facebook/sapiens2-*` running on free HF CPU. 15 variants exposed: seg, normal, pointmap, pose across 0.4b, 0.8b, 1b, plus seg-5b, normal-5b, pointmap-5b as INT8 ONNX. Curl-callable with a Bearer token.
## Variants and inference time on the included 6000×4000 demo image
| Task | Notes | 0.4b | 0.8b | 1b | 5b (INT8 ONNX) |
|---|---|---|---|---|---|
| seg | DOME 29-class body parts | 57 s | 74 s | 208 s | 189 s |
| normal | per-pixel surface normals | 72 s | 84 s | 206 s | 359 s |
| pointmap | per-pixel XYZ in meters | 78 s | 99 s | 274 s | 386 s |
| pose | DETR detect, 308 keypoints | 47 s | 68 s | 232 s | not shipped |
Verified 15/15 via Gradio API on 2026-05-12. Times include first-call downloads.
0.4b through 1b run as fp32 PyTorch. 5B runs as INT8 ONNX (5 to 6 GB on disk; fp32 5B would need ~20 GB RAM, more than the free tier provides). Dense 0.4b/0.8b share an LRU(2) cache. Loading any 1B variant hard-clears all model caches (dense + pose + ORT) since 16 GB cpu-basic cannot fit two 1B-class models simultaneously. Pose has its own slot and DETR (`facebook/detr-resnet-50`) is sticky-loaded once.
**5B chain limitation:** calling a 5B variant right after another 5B variant on the same Space instance OOMs. ONNX Runtime's C++ session shutdown is not synchronous with the Python `_ORT_SESSIONS.clear()` call, so loading the next 5B session before the previous one's worker threads exit peaks RAM above 16 GB. If you need to benchmark multiple 5B variants, factory-restart the Space (Settings → Factory restart) between calls, or run one variant per cold Space.
The model fixes a 1024×768 input tensor (NCHW with H=1024, W=768, a portrait canvas in Meta's convention). Any input is aspect-preserve resized then padded to that.
## CPU-friendly ONNX exports
Companion repo: [`WeReCooking/sapiens2-onnx`](https://huggingface.co/WeReCooking/sapiens2-onnx) (public). Files live in per-task folders `seg/`, `normal/`, `pointmap/`, `pose/`. Each variant is `<task>/<task>_<size>_<precision>.onnx` plus a `.onnx.data` external sidecar. 15 ONNX artifacts shipped: 12 covering 0.4b/0.8b/1b (fp16 for seg-0.4b, fp32 for the rest), and 3 new 5B int8 files (seg, normal, pointmap). Cosine similarity vs PyTorch fp32 is 0.999 or better on all shipped variants.
Turnkey CLI built into `app.py` (no sapiens2 / PyTorch dep needed; install `requirements.txt`):
```bash
export HF_TOKEN=hf_xxx
python app.py onnx seg 0.4b photo.jpg --output seg_overlay.png
python app.py onnx normal 1b photo.jpg --output normals.png
python app.py onnx pointmap 0.8b photo.jpg --output depth.png
python app.py onnx pose 0.4b photo.jpg --output pose.png
python app.py onnx seg 5b photo.jpg --output seg_5b.png
```
## Curl tests
```bash
TOKEN="hf_xxx"
SPACE="https://werecooking-sapiens2-cpu.hf.space"
IMG="https://huggingface.co/spaces/facebook/sapiens2-seg/resolve/main/assets/images/pexels-alex-green-5699868.jpg"
EVT=$(curl -s -X POST "$SPACE/gradio_api/call/predict" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d "{\"data\":[{\"path\":\"$IMG\",\"meta\":{\"_type\":\"gradio.FileData\"}},\"seg\",\"0.4b\"]}" \
| python -c "import sys,json;print(json.load(sys.stdin)['event_id'])")
curl -sN "$SPACE/gradio_api/call/predict/$EVT" -H "Authorization: Bearer $TOKEN"
```
## Logs (SSE)
```bash
curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/build"
curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/run"
```
## 5B INT8 ONNX conversion recipe
The dense 5B variants ship as INT8 ONNX. To re-run the pipeline:
1. Export fp16 ONNX using lazy fp16 init. Call `torch.set_default_dtype(torch.float16)` before `init_model(cfg, None, device="cpu")`, then stream the safetensors file tensor by tensor into the empty fp16 model. This avoids the ~22 GB fp32 init that OOMs on a 32 GB box. Export with `opset_version=18` and no `dynamic_axes`. Force `sys.stdout.reconfigure(encoding="utf-8")` so torch.onnx's success print does not crash on Windows cp1252.
2. Stream cast fp16 to fp32 on disk via `onnx.external_data_helper.load_external_data_for_model` plus per-tensor `numpy_helper`. Peak RAM stays close to a single tensor (~250 MB). Drop Cast(fp16 / fp32) nodes with transitive rename closure so consumers point at the original input.
3. Run `quantize.shape_inference.quant_pre_process(skip_onnx_shape=True, skip_optimization=True)`. This routes through ORT symbolic shape inference which understands sapiens2 windowed attention. Vanilla `onnx.shape_inference` errors with `(6144) vs (512)` on the pointmap and normal heads.
4. `quantize_dynamic(weight_type=QuantType.QInt8, per_channel=True, op_types_to_quantize=["MatMul"], use_external_data_format=True)`. This lowers to `MatMulIntegerToFloat`, which accepts fp32 input and has no 2D-only filter (unlike `MatMulNBitsQuantizer` which silently skips 3D packed-QKV weights).
Pose-5b is not shipped. It uses a different forward signature (single person bbox cropped tensor) and the int8 quantize attempt did not complete on the available hardware.
## Files
* `app.py` everything: Gradio Space UI, PyTorch dispatch for 0.4b/0.8b/1b, ORT for 5B, inlined keypoint visualization, plus the `python app.py onnx ...` CLI
* `requirements.txt` Python deps including `sapiens @ git+https://github.com/facebookresearch/sapiens2.git`
* `packages.txt` apt deps (`libgl1`, `libglib2.0-0`) installed by the Gradio SDK at build time