pose_mlp_v2 / README.md
ljchang's picture
Initial v2 release: 512-256-128 LayerNorm+GELU MLP distilled from img2pose on CelebV-HQ
2fa5435 verified
|
Raw
History Blame Contribute Delete
7.15 kB
---
tags:
- pytorch
- safetensors
- pose-estimation
- head-pose
- landmark-to-pose
- distillation
- py-feat
library_name: py-feat
pipeline_tag: image-feature-extraction
license: mit
---
# Py-Feat Pose-MLP v2 — Landmark-to-6DoF Head Pose
A small distilled MLP that takes 68 face landmarks (the dlib-68 / OpenFace
layout produced by `mobilefacenet`, OpenFace, etc.) and emits 6DoF head
pose calibrated to img2pose's coordinate frame. Designed for `py-feat`
pipelines that use a face detector without a built-in pose head (e.g.
RetinaFace in `py-feat ≥ 0.7`).
## Model Description
`py-feat`'s v0.6 production pipeline used `img2pose` as its face detector,
which multi-tasks face localization with 6DoF head pose regression — so
pose came "for free" from the detector. In v0.7 the default face detector
became `RetinaFace` (much higher WIDERFACE Hard AP) which only detects
faces. To preserve the `Fex` schema (`pitch`, `roll`, `yaw`, `x`, `y`,
`z` columns), `py-feat` distills img2pose's pose regression into a small
MLP that operates entirely on already-computed landmarks.
The MLP is bbox-free: it normalizes incoming landmarks by their centroid
and inter-eye distance, so the same model works regardless of whether
the upstream detector produced loose (img2pose) or tight (RetinaFace)
face crops.
## Model Details
- **Model type**: Multi-layer perceptron (MLP)
- **Architecture**: `Linear(136→512) → LayerNorm → GELU → Dropout(0.15)
→ Linear(512→256) → LayerNorm → GELU → Dropout → Linear(256→128) →
LayerNorm → GELU → Dropout → Linear(128→6)`
- **Parameter count**: 236,934 (~0.9 MB safetensors)
- **Input**: 68 2D landmarks, normalized by landmark centroid and
inter-eye distance (`feat.utils.face_pose_mlp.normalize_landmarks`).
- **Output**: 6 values — `[Pitch, Roll, Yaw, X, Y, Z]`. The MLP emits
z-scored values; the loader de-normalizes using `mean`/`std` stored in
the sidecar `pose_mlp_v2.json`. Angles are radians, calibrated to
img2pose's coordinate frame.
- **Framework**: PyTorch (safetensors weight file, no pickle).
- **Inference cost**: ~10 µs / face on CPU (batched), negligible vs.
the upstream face/landmark stages.
## Training Details
- **Teacher**: `img2pose` (Albiero et al., 2021). The MLP is trained to
match img2pose's regressed `[Pitch, Roll, Yaw, X, Y, Z]` outputs.
- **Training corpus**: CelebV-HQ — `n_clips = 35,445`,
`n_train_frames = 2,783,134`, `n_val_frames = 154,619`. Frames with
`FaceScore < 0.8` or `|pose| > 75°` are dropped (filters bad teacher
signal on degenerate poses).
- **Loss**: MSE on z-scored 6D output.
- **Optimizer**: Adam, `lr=1e-3`, `batch_size=1024`.
- **Epochs**: 40 (best val loss at last epoch — see `pose_mlp_v2.json`
for per-epoch history).
- **Hardware**: single GPU (training takes ~2 hr).
- **Seed**: 42.
### Held-out validation MAE on CelebV-HQ (clip-disjoint split)
| Axis | MAE (°) |
|---|---|
| Pitch | 2.66 |
| Roll | 2.34 |
| Yaw | 1.58 |
For reference, img2pose's reported MAE on the AFLW2000-3D / BIWI test
sets is ~4° average. The MLP cannot exceed its teacher; values here are
the gap between the MLP and the teacher's predictions, not against a
ground-truth motion-capture rig.
### v1 → v2 changelog
| Aspect | v1 | v2 |
|---|---|---|
| Hidden | 256→128→64 | 512→256→128 |
| Activation | Linear → ReLU → Dropout | Linear → LayerNorm → GELU → Dropout |
| Dropout | 0.10 | 0.15 |
| Training frames | 569,678 | 2,783,134 |
| Epochs | 30 | 40 |
| Best val loss | 0.0809 | 0.0777 |
| Roll MAE (°) | 2.530 | 2.335 |
## Intended Use
- **Primary**: Drop-in replacement for img2pose's pose head when using
`py-feat` with a face detector that doesn't predict pose
(`face_model='retinaface'` in `feat.Detector`, MediaPipe in
`feat.MPDetector`).
- **Secondary**: Any pipeline that produces 68 dlib-style face landmarks
and wants img2pose-compatible head pose without re-running img2pose.
### Out of scope
- Eye / gaze direction — use `L2CS-Net` for gaze.
- Mediapipe-478 landmarks — translate to 68 dlib landmarks first.
- Static head-pose inference from a single landmark (less than 68 pts).
## Usage
The MLP is loaded automatically by `feat.Detector` when
`face_model != 'img2pose'`. To call it directly:
```python
import torch
from feat.utils.face_pose_mlp import pose_from_landmarks_mlp
# 68 (x, y) landmarks in image-pixel coordinates, e.g. from mobilefacenet.
landmarks = torch.tensor([
# ... [68, 2] ...
], dtype=torch.float32).unsqueeze(0) # [1, 68, 2]
pose = pose_from_landmarks_mlp(landmarks) # [1, 6]: (Pitch, Roll, Yaw, X, Y, Z)
print(pose)
```
Weights resolve from (in order):
1. `FEAT_POSE_MLP_PATH` environment variable
2. `models/pose_mlp_v2.safetensors` in the repo
3. This HuggingFace repo (`py-feat/pose_mlp_v2`)
## Limitations
- The MLP cannot improve on img2pose's accuracy — it only matches it
more efficiently with bbox-free input. Use img2pose directly if you
need img2pose's exact behavior (a tiny ~1° distillation gap may remain).
- Trained on CelebV-HQ — performance on non-frontal, occluded, or
heavily-rotated faces (>75°) is degraded by both the teacher and the
data filter.
- Output coordinates are img2pose's frame, not a standard FACS / BIWI
frame. Pose values are interpretable across the `py-feat` pipeline
but may need recalibration to compare with other tools.
## Citation
If you use `py-feat` and this pose-MLP, please cite both `py-feat` and
img2pose:
```bibtex
@article{cheong2023pyfeat,
title={Py-Feat: Python Facial Expression Analysis Toolbox},
author={Cheong, Jin Hyun and Jolly, Eshin and Xie, Tiankang and Byrne, Sophie and Kenney, Matthew and Chang, Luke J.},
journal={Affective Science},
volume={4},
pages={781--796},
year={2023}
}
@inproceedings{albiero2021img2pose,
title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={7617--7627},
year={2021}
}
@inproceedings{zhu2022celebvhq,
title={CelebV-HQ: A Large-Scale Video Facial Attributes Dataset},
author={Zhu, Hao and Wu, Wayne and Zhu, Wentao and Jiang, Liming and Tang, Siwei and Zhang, Li and Liu, Ziwei and Loy, Chen Change},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2022}
}
```
## License
MIT (this distillation). The teacher (`img2pose`) is BSD-3, and the
training corpus (CelebV-HQ) is released for non-commercial research
use — please honor each upstream license if you re-train or
re-distribute.
## Files
- `pose_mlp_v2.safetensors` — model weights (1 MB)
- `pose_mlp_v2.json` — architecture, output-normalization stats, training
history, validation MAE per epoch
- `README.md` — this card
## Acknowledgments
Distilled from img2pose by Vítor Albiero et al. (Meta AI / NVIDIA),
trained on CelebV-HQ by Hao Zhu et al. (CUHK / S-Lab NTU). Built and
maintained by [Cosanlab](https://cosanlab.com) at Dartmouth.