Image Feature Extraction
Py-Feat
PyTorch
Safetensors
pose-estimation
head-pose
landmark-to-pose
distillation
Instructions to use py-feat/pose_mlp_v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Py-Feat
How to use py-feat/pose_mlp_v2 with Py-Feat:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| tags: | |
| - pytorch | |
| - safetensors | |
| - pose-estimation | |
| - head-pose | |
| - landmark-to-pose | |
| - distillation | |
| - py-feat | |
| library_name: py-feat | |
| pipeline_tag: image-feature-extraction | |
| license: mit | |
| # Py-Feat Pose-MLP v2 — Landmark-to-6DoF Head Pose | |
| A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace | |
| layout produced by `mobilefacenet`, OpenFace, etc.) and emits 6DoF head | |
| pose calibrated to img2pose's coordinate frame. Designed for `py-feat` | |
| pipelines that use a face detector without a built-in pose head (e.g. | |
| RetinaFace in `py-feat ≥ 0.7`). | |
| ## Model Description | |
| `py-feat`'s v0.6 production pipeline used `img2pose` as its face detector, | |
| which multi-tasks face localization with 6DoF head pose regression — so | |
| pose came "for free" from the detector. In v0.7 the default face detector | |
| became `RetinaFace` (much higher WIDERFACE Hard AP) which only detects | |
| faces. To preserve the `Fex` schema (`pitch`, `roll`, `yaw`, `x`, `y`, | |
| `z` columns), `py-feat` distills img2pose's pose regression into a small | |
| MLP that operates entirely on already-computed landmarks. | |
| The MLP is bbox-free: it normalizes incoming landmarks by their centroid | |
| and inter-eye distance, so the same model works regardless of whether | |
| the upstream detector produced loose (img2pose) or tight (RetinaFace) | |
| face crops. | |
| ## Model Details | |
| - **Model type**: Multi-layer perceptron (MLP) | |
| - **Architecture**: `Linear(136→512) → LayerNorm → GELU → Dropout(0.15) | |
| → Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→128) → | |
| LayerNorm → GELU → Dropout → Linear(128→6)` | |
| - **Parameter count**: 236,934 (~0.9 MB safetensors) | |
| - **Input**: 68 2D landmarks, normalized by landmark centroid and | |
| inter-eye distance (`feat.utils.face_pose_mlp.normalize_landmarks`). | |
| - **Output**: 6 values — `[Pitch, Roll, Yaw, X, Y, Z]`. The MLP emits | |
| z-scored values; the loader de-normalizes using `mean`/`std` stored in | |
| the sidecar `pose_mlp_v2.json`. Angles are radians, calibrated to | |
| img2pose's coordinate frame. | |
| - **Framework**: PyTorch (safetensors weight file, no pickle). | |
| - **Inference cost**: ~10 µs / face on CPU (batched), negligible vs. | |
| the upstream face/landmark stages. | |
| ## Training Details | |
| - **Teacher**: `img2pose` (Albiero et al., 2021). The MLP is trained to | |
| match img2pose's regressed `[Pitch, Roll, Yaw, X, Y, Z]` outputs. | |
| - **Training corpus**: CelebV-HQ — `n_clips = 35,445`, | |
| `n_train_frames = 2,783,134`, `n_val_frames = 154,619`. Frames with | |
| `FaceScore < 0.8` or `|pose| > 75°` are dropped (filters bad teacher | |
| signal on degenerate poses). | |
| - **Loss**: MSE on z-scored 6D output. | |
| - **Optimizer**: Adam, `lr=1e-3`, `batch_size=1024`. | |
| - **Epochs**: 40 (best val loss at last epoch — see `pose_mlp_v2.json` | |
| for per-epoch history). | |
| - **Hardware**: single GPU (training takes ~2 hr). | |
| - **Seed**: 42. | |
| ### Held-out validation MAE on CelebV-HQ (clip-disjoint split) | |
| | Axis | MAE (°) | | |
| |---|---| | |
| | Pitch | 2.66 | | |
| | Roll | 2.34 | | |
| | Yaw | 1.58 | | |
| For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test | |
| sets is ~4° average. The MLP cannot exceed its teacher; values here are | |
| the gap between the MLP and the teacher's predictions, not against a | |
| ground-truth motion-capture rig. | |
| ### v1 → v2 changelog | |
| | Aspect | v1 | v2 | | |
| |---|---|---| | |
| | Hidden | 256→128→64 | 512→256→128 | | |
| | Activation | Linear → ReLU → Dropout | Linear → LayerNorm → GELU → Dropout | | |
| | Dropout | 0.10 | 0.15 | | |
| | Training frames | 569,678 | 2,783,134 | | |
| | Epochs | 30 | 40 | | |
| | Best val loss | 0.0809 | 0.0777 | | |
| | Roll MAE (°) | 2.530 | 2.335 | | |
| ## Intended Use | |
| - **Primary**: Drop-in replacement for img2pose's pose head when using | |
| `py-feat` with a face detector that doesn't predict pose | |
| (`face_model='retinaface'` in `feat.Detector`, MediaPipe in | |
| `feat.MPDetector`). | |
| - **Secondary**: Any pipeline that produces 68 dlib-style face landmarks | |
| and wants img2pose-compatible head pose without re-running img2pose. | |
| ### Out of scope | |
| - Eye / gaze direction — use `L2CS-Net` for gaze. | |
| - Mediapipe-478 landmarks — translate to 68 dlib landmarks first. | |
| - Static head-pose inference from a single landmark (less than 68 pts). | |
| ## Usage | |
| The MLP is loaded automatically by `feat.Detector` when | |
| `face_model != 'img2pose'`. To call it directly: | |
| ```python | |
| import torch | |
| from feat.utils.face_pose_mlp import pose_from_landmarks_mlp | |
| # 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet. | |
| landmarks = torch.tensor([ | |
| # ... [68, 2] ... | |
| ], dtype=torch.float32).unsqueeze(0) # [1, 68, 2] | |
| pose = pose_from_landmarks_mlp(landmarks) # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z) | |
| print(pose) | |
| ``` | |
| Weights resolve from (in order): | |
| 1. `FEAT_POSE_MLP_PATH` environment variable | |
| 2. `models/pose_mlp_v2.safetensors` in the repo | |
| 3. This HuggingFace repo (`py-feat/pose_mlp_v2`) | |
| ## Limitations | |
| - The MLP cannot improve on img2pose's accuracy — it only matches it | |
| more efficiently with bbox-free input. Use img2pose directly if you | |
| need img2pose's exact behavior (a tiny ~1° distillation gap may remain). | |
| - Trained on CelebV-HQ — performance on non-frontal, occluded, or | |
| heavily-rotated faces (>75°) is degraded by both the teacher and the | |
| data filter. | |
| - Output coordinates are img2pose's frame, not a standard FACS / BIWI | |
| frame. Pose values are interpretable across the `py-feat` pipeline | |
| but may need recalibration to compare with other tools. | |
| ## Citation | |
| If you use `py-feat` and this pose-MLP, please cite both `py-feat` and | |
| img2pose: | |
| ```bibtex | |
| @article{cheong2023pyfeat, | |
| title={Py-Feat: Python Facial Expression Analysis Toolbox}, | |
| author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.}, | |
| journal={Affective Science}, | |
| volume={4}, | |
| pages={781--796}, | |
| year={2023} | |
| } | |
| @inproceedings{albiero2021img2pose, | |
| title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation}, | |
| author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal}, | |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, | |
| pages={7617--7627}, | |
| year={2021} | |
| } | |
| @inproceedings{zhu2022celebvhq, | |
| title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset}, | |
| author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change}, | |
| booktitle={Proceedings of the European Conference on Computer Vision (ECCV)}, | |
| year={2022} | |
| } | |
| ``` | |
| ## License | |
| MIT (this distillation). The teacher (`img2pose`) is BSD-3, and the | |
| training corpus (CelebV-HQ) is released for non-commercial research | |
| use — please honor each upstream license if you re-train or | |
| re-distribute. | |
| ## Files | |
| - `pose_mlp_v2.safetensors` — model weights (1 MB) | |
| - `pose_mlp_v2.json` — architecture, output-normalization stats, training | |
| history, validation MAE per epoch | |
| - `README.md` — this card | |
| ## Acknowledgments | |
| Distilled from img2pose by Vítor Albiero et al. (Meta AI / NVIDIA), | |
| trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and | |
| maintained by [Cosanlab](https://cosanlab.com) at Dartmouth. | |