Initial v2 release: 512-256-128 LayerNorm+GELU MLP distilled from img2pose on CelebV-HQ

2fa5435 verified about 2 months ago

7.15 kB

	---
	tags:
	- pytorch
	- safetensors
	- pose-estimation
	- head-pose
	- landmark-to-pose
	- distillation
	- py-feat
	library_name: py-feat
	pipeline_tag: image-feature-extraction
	license: mit
	---

	# Py-Feat Pose-MLP v2 — Landmark-to-6DoF Head Pose

	A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace
	layout produced by `mobilefacenet`, OpenFace, etc.) and emits 6DoF head
	pose calibrated to img2pose's coordinate frame. Designed for `py-feat`
	pipelines that use a face detector without a built-in pose head (e.g.
	RetinaFace in `py-feat ≥ 0.7`).

	## Model Description

	`py-feat`'s v0.6 production pipeline used `img2pose` as its face detector,
	which multi-tasks face localization with 6DoF head pose regression — so
	pose came "for free" from the detector. In v0.7 the default face detector
	became `RetinaFace` (much higher WIDERFACE Hard AP) which only detects
	faces. To preserve the `Fex` schema (`pitch`, `roll`, `yaw`, `x`, `y`,
	`z` columns), `py-feat` distills img2pose's pose regression into a small
	MLP that operates entirely on already-computed landmarks.

	The MLP is bbox-free: it normalizes incoming landmarks by their centroid
	and inter-eye distance, so the same model works regardless of whether
	the upstream detector produced loose (img2pose) or tight (RetinaFace)
	face crops.

	## Model Details

	- Model type: Multi-layer perceptron (MLP)
	- Architecture: `Linear(136→512) → LayerNorm → GELU → Dropout(0.15)
	→ Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→128) →
	LayerNorm → GELU → Dropout → Linear(128→6)`
	- Parameter count: 236,934 (~0.9 MB safetensors)
	- Input: 68 2D landmarks, normalized by landmark centroid and
	inter-eye distance (`feat.utils.face_pose_mlp.normalize_landmarks`).
	- Output: 6 values — `[Pitch, Roll, Yaw, X, Y, Z]`. The MLP emits
	z-scored values; the loader de-normalizes using `mean`/`std` stored in
	the sidecar `pose_mlp_v2.json`. Angles are radians, calibrated to
	img2pose's coordinate frame.
	- Framework: PyTorch (safetensors weight file, no pickle).
	- Inference cost: ~10 µs / face on CPU (batched), negligible vs.
	the upstream face/landmark stages.

	## Training Details

	- Teacher: `img2pose` (Albiero et al., 2021). The MLP is trained to
	match img2pose's regressed `[Pitch, Roll, Yaw, X, Y, Z]` outputs.
	- Training corpus: CelebV-HQ — `n_clips = 35,445`,
	`n_train_frames = 2,783,134`, `n_val_frames = 154,619`. Frames with
	`FaceScore < 0.8` or `\|pose\| > 75°` are dropped (filters bad teacher
	signal on degenerate poses).
	- Loss: MSE on z-scored 6D output.
	- Optimizer: Adam, `lr=1e-3`, `batch_size=1024`.
	- Epochs: 40 (best val loss at last epoch — see `pose_mlp_v2.json`
	for per-epoch history).
	- Hardware: single GPU (training takes ~2 hr).
	- Seed: 42.

	### Held-out validation MAE on CelebV-HQ (clip-disjoint split)

	\| Axis \| MAE (°) \|
	\|---\|---\|
	\| Pitch \| 2.66 \|
	\| Roll \| 2.34 \|
	\| Yaw \| 1.58 \|

	For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test
	sets is ~4° average. The MLP cannot exceed its teacher; values here are
	the gap between the MLP and the teacher's predictions, not against a
	ground-truth motion-capture rig.

	### v1 → v2 changelog

	\| Aspect \| v1 \| v2 \|
	\|---\|---\|---\|
	\| Hidden \| 256→128→64 \| 512→256→128 \|
	\| Activation \| Linear → ReLU → Dropout \| Linear → LayerNorm → GELU → Dropout \|
	\| Dropout \| 0.10 \| 0.15 \|
	\| Training frames \| 569,678 \| 2,783,134 \|
	\| Epochs \| 30 \| 40 \|
	\| Best val loss \| 0.0809 \| 0.0777 \|
	\| Roll MAE (°) \| 2.530 \| 2.335 \|

	## Intended Use

	- Primary: Drop-in replacement for img2pose's pose head when using
	`py-feat` with a face detector that doesn't predict pose
	(`face_model='retinaface'` in `feat.Detector`, MediaPipe in
	`feat.MPDetector`).
	- Secondary: Any pipeline that produces 68 dlib-style face landmarks
	and wants img2pose-compatible head pose without re-running img2pose.

	### Out of scope

	- Eye / gaze direction — use `L2CS-Net` for gaze.
	- Mediapipe-478 landmarks — translate to 68 dlib landmarks first.
	- Static head-pose inference from a single landmark (less than 68 pts).

	## Usage

	The MLP is loaded automatically by `feat.Detector` when
	`face_model != 'img2pose'`. To call it directly:

	```python
	import torch
	from feat.utils.face_pose_mlp import pose_from_landmarks_mlp

	# 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet.
	landmarks = torch.tensor([
	# ... [68, 2] ...
	], dtype=torch.float32).unsqueeze(0) # [1, 68, 2]

	pose = pose_from_landmarks_mlp(landmarks) # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z)
	print(pose)
	```

	Weights resolve from (in order):
	1. `FEAT_POSE_MLP_PATH` environment variable
	2. `models/pose_mlp_v2.safetensors` in the repo
	3. This HuggingFace repo (`py-feat/pose_mlp_v2`)

	## Limitations

	- The MLP cannot improve on img2pose's accuracy — it only matches it
	more efficiently with bbox-free input. Use img2pose directly if you
	need img2pose's exact behavior (a tiny ~1° distillation gap may remain).
	- Trained on CelebV-HQ — performance on non-frontal, occluded, or
	heavily-rotated faces (>75°) is degraded by both the teacher and the
	data filter.
	- Output coordinates are img2pose's frame, not a standard FACS / BIWI
	frame. Pose values are interpretable across the `py-feat` pipeline
	but may need recalibration to compare with other tools.

	## Citation

	If you use `py-feat` and this pose-MLP, please cite both `py-feat` and
	img2pose:

	```bibtex
	@article{cheong2023pyfeat,
	title={Py-Feat: Python Facial Expression Analysis Toolbox},
	author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.},
	journal={Affective Science},
	volume={4},
	pages={781--796},
	year={2023}
	}

	@inproceedings{albiero2021img2pose,
	title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
	author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	pages={7617--7627},
	year={2021}
	}

	@inproceedings{zhu2022celebvhq,
	title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset},
	author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change},
	booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
	year={2022}
	}
	```

	## License

	MIT (this distillation). The teacher (`img2pose`) is BSD-3, and the
	training corpus (CelebV-HQ) is released for non-commercial research
	use — please honor each upstream license if you re-train or
	re-distribute.

	## Files

	- `pose_mlp_v2.safetensors` — model weights (1 MB)
	- `pose_mlp_v2.json` — architecture, output-normalization stats, training
	history, validation MAE per epoch
	- `README.md` — this card

	## Acknowledgments

	Distilled from img2pose by Vítor Albiero et al. (Meta AI / NVIDIA),
	trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and
	maintained by [Cosanlab](https://cosanlab.com) at Dartmouth.