Spaces:

WeReCooking
/

sapiens2-cpu

Running

App Files Files Community

sapiens2-cpu / README.md

Nekochu

README: verified 15/15 times, document 5B chain OOM

cb4d7d1 about 14 hours ago

preview code

raw

history blame contribute delete

5.68 kB

	---
	title: Sapiens2 CPU
	emoji: 🧍
	colorFrom: indigo
	colorTo: green
	sdk: gradio
	sdk_version: 6.14.0
	app_file: app.py
	pinned: false
	license: other
	---

	# Sapiens2 CPU

	Meta's `facebook/sapiens2-*` running on free HF CPU. 15 variants exposed: seg, normal, pointmap, pose across 0.4b, 0.8b, 1b, plus seg-5b, normal-5b, pointmap-5b as INT8 ONNX. Curl-callable with a Bearer token.

	## Variants and inference time on the included 6000×4000 demo image

	\| Task \| Notes \| 0.4b \| 0.8b \| 1b \| 5b (INT8 ONNX) \|
	\|---\|---\|---\|---\|---\|---\|
	\| seg \| DOME 29-class body parts \| 57 s \| 74 s \| 208 s \| 189 s \|
	\| normal \| per-pixel surface normals \| 72 s \| 84 s \| 206 s \| 359 s \|
	\| pointmap \| per-pixel XYZ in meters \| 78 s \| 99 s \| 274 s \| 386 s \|
	\| pose \| DETR detect, 308 keypoints \| 47 s \| 68 s \| 232 s \| not shipped \|

	Verified 15/15 via Gradio API on 2026-05-12. Times include first-call downloads.

	0.4b through 1b run as fp32 PyTorch. 5B runs as INT8 ONNX (5 to 6 GB on disk; fp32 5B would need ~20 GB RAM, more than the free tier provides). Dense 0.4b/0.8b share an LRU(2) cache. Loading any 1B variant hard-clears all model caches (dense + pose + ORT) since 16 GB cpu-basic cannot fit two 1B-class models simultaneously. Pose has its own slot and DETR (`facebook/detr-resnet-50`) is sticky-loaded once.

	5B chain limitation: calling a 5B variant right after another 5B variant on the same Space instance OOMs. ONNX Runtime's C++ session shutdown is not synchronous with the Python `_ORT_SESSIONS.clear()` call, so loading the next 5B session before the previous one's worker threads exit peaks RAM above 16 GB. If you need to benchmark multiple 5B variants, factory-restart the Space (Settings → Factory restart) between calls, or run one variant per cold Space.

	The model fixes a 1024×768 input tensor (NCHW with H=1024, W=768, a portrait canvas in Meta's convention). Any input is aspect-preserve resized then padded to that.

	## CPU-friendly ONNX exports

	Companion repo: [`WeReCooking/sapiens2-onnx`](https://huggingface.co/WeReCooking/sapiens2-onnx) (public). Files live in per-task folders `seg/`, `normal/`, `pointmap/`, `pose/`. Each variant is `<task>/<task>_<size>_<precision>.onnx` plus a `.onnx.data` external sidecar. 15 ONNX artifacts shipped: 12 covering 0.4b/0.8b/1b (fp16 for seg-0.4b, fp32 for the rest), and 3 new 5B int8 files (seg, normal, pointmap). Cosine similarity vs PyTorch fp32 is 0.999 or better on all shipped variants.

	Turnkey CLI built into `app.py` (no sapiens2 / PyTorch dep needed; install `requirements.txt`):

	```bash
	export HF_TOKEN=hf_xxx
	python app.py onnx seg 0.4b photo.jpg --output seg_overlay.png
	python app.py onnx normal 1b photo.jpg --output normals.png
	python app.py onnx pointmap 0.8b photo.jpg --output depth.png
	python app.py onnx pose 0.4b photo.jpg --output pose.png
	python app.py onnx seg 5b photo.jpg --output seg_5b.png
	```

	## Curl tests

	```bash
	TOKEN="hf_xxx"
	SPACE="https://werecooking-sapiens2-cpu.hf.space"
	IMG="https://huggingface.co/spaces/facebook/sapiens2-seg/resolve/main/assets/images/pexels-alex-green-5699868.jpg"

	EVT=$(curl -s -X POST "$SPACE/gradio_api/call/predict" \
	-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
	-d "{\"data\":[{\"path\":\"$IMG\",\"meta\":{\"_type\":\"gradio.FileData\"}},\"seg\",\"0.4b\"]}" \
	\| python -c "import sys,json;print(json.load(sys.stdin)['event_id'])")
	curl -sN "$SPACE/gradio_api/call/predict/$EVT" -H "Authorization: Bearer $TOKEN"
	```

	## Logs (SSE)

	```bash
	curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/build"
	curl -N -H "Authorization: Bearer $TOKEN" "https://huggingface.co/api/spaces/WeReCooking/sapiens2-cpu/logs/run"
	```

	## 5B INT8 ONNX conversion recipe

	The dense 5B variants ship as INT8 ONNX. To re-run the pipeline:

	1. Export fp16 ONNX using lazy fp16 init. Call `torch.set_default_dtype(torch.float16)` before `init_model(cfg, None, device="cpu")`, then stream the safetensors file tensor by tensor into the empty fp16 model. This avoids the ~22 GB fp32 init that OOMs on a 32 GB box. Export with `opset_version=18` and no `dynamic_axes`. Force `sys.stdout.reconfigure(encoding="utf-8")` so torch.onnx's success print does not crash on Windows cp1252.
	2. Stream cast fp16 to fp32 on disk via `onnx.external_data_helper.load_external_data_for_model` plus per-tensor `numpy_helper`. Peak RAM stays close to a single tensor (~250 MB). Drop Cast(fp16 / fp32) nodes with transitive rename closure so consumers point at the original input.
	3. Run `quantize.shape_inference.quant_pre_process(skip_onnx_shape=True, skip_optimization=True)`. This routes through ORT symbolic shape inference which understands sapiens2 windowed attention. Vanilla `onnx.shape_inference` errors with `(6144) vs (512)` on the pointmap and normal heads.
	4. `quantize_dynamic(weight_type=QuantType.QInt8, per_channel=True, op_types_to_quantize=["MatMul"], use_external_data_format=True)`. This lowers to `MatMulIntegerToFloat`, which accepts fp32 input and has no 2D-only filter (unlike `MatMulNBitsQuantizer` which silently skips 3D packed-QKV weights).

	Pose-5b is not shipped. It uses a different forward signature (single person bbox cropped tensor) and the int8 quantize attempt did not complete on the available hardware.

	## Files

	* `app.py` everything: Gradio Space UI, PyTorch dispatch for 0.4b/0.8b/1b, ORT for 5B, inlined keypoint visualization, plus the `python app.py onnx ...` CLI
	* `requirements.txt` Python deps including `sapiens @ git+https://github.com/facebookresearch/sapiens2.git`
	* `packages.txt` apt deps (`libgl1`, `libglib2.0-0`) installed by the Gradio SDK at build time