py-feat
/

face_multitask_v2

Image Classification

facial-expression-analysis

emotion-recognition

gaze-estimation

Model card Files Files and versions

face_multitask_v2 / README.md

ljchang's picture

Update README.md

52a0cf7 verified 14 days ago

|

History Blame Contribute Delete

3.21 kB

	---
	license: other
	license_name: research-only
	license_link: LICENSE
	library_name: py-feat
	tags:
	- facial-expression-analysis
	- action-units
	- emotion-recognition
	- gaze-estimation
	- face-landmarks
	- head-pose
	- blendshapes
	- multitask
	pipeline_tag: image-classification
	---

	# face_multitask_v2

	A single multi-task convolutional model for facial behavior analysis, used by
	[py-feat](https://github.com/cosanlab/py-feat)'s `Detectorv2`. From one face crop
	it jointly predicts **action units, categorical emotion, valence/arousal,
	eye gaze, a 478-point face mesh, 6-DoF head pose, and 52 MediaPipe/ARKit
	blendshapes**.

	- Backbone: ConvNeXt-V2 Tiny (FCMAE + IN-22k/IN-1k pretrained)
	- Heads: AU graph (AFG/FGG/SC) + unified-feature emotion/V-A and
	gaze heads + landmark, pose, and blendshape regression heads
	- Params: ~30M · Input: 224×224 RGB (from a 256×256 face crop)
	- File: `face_multitask_v2.safetensors` (safetensors; `ModelV2Config` JSON in the file metadata)

	## Outputs

	\| Task \| Output \| Notes \|
	\|---\|---\|---\|
	\| Action Units \| 20 probabilities [0,1] \| AU01,02,04,05,06,07,09,10,11,12,14,15,17,20,23,24,25,26,28,43 \|
	\| Emotion \| 7-class softmax \| Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger \|
	\| Valence / Arousal \| 2 × [−1,1] \| tanh \|
	\| Gaze \| (yaw, pitch) radians \| head-centric; yaw+ = right, pitch+ = up \|
	\| Face mesh \| 478 × (x,y,z) \| MediaPipe topology, chip-pixel coords (z = relative depth) \|
	\| Head pose \| (yaw, pitch, roll, tx, ty, tz) \| radians / pixels \|
	\| 68 landmarks \| derived \| dlib-68 subset sampled from the 478 mesh \|
	\| Blendshapes \| 52 coefficients [0,1] \| MediaPipe/ARKit standard names (browInnerUp, jawOpen, mouthSmileLeft, …) \|

	## Benchmarks (held-out, file-verified — v2.5 deployed checkpoint)

	\| Task \| Dataset \| Metric \| Score \|
	\|---\|---\|---\|---\|
	\| AU \| DISFA+ (12-AU, Cheong protocol) \| macro-F1 \| 0.693 \|
	\| AU \| DISFA+ (8-AU subset) \| macro-F1 \| 0.740 \|
	\| Emotion \| RAF-DB official test (7-cls) \| acc / macro-F1 \| 0.910 / 0.885 \|
	\| Emotion \| AffectNet val (7-cls, drop Contempt) \| acc / macro-F1 \| 0.616 / 0.612 \|
	\| Valence/Arousal \| Aff-Wild2 official validation \| CCC (V / A) \| 0.852 / 0.799 \|
	\| Gaze \| MPIIGaze (leave-subject-out) \| mean angular err \| 7.05° \|
	\| Gaze \| Gaze360 (held-out split) \| mean angular err \| 12.89° \|

	Notes: **Gaze numbers are now leave-subject-out
	held-out** (honest generalization); Numbers are from the deployed checkpoint
	(`v25c_release_ep14`), weight-verified against the published `.safetensors`.

	## Usage

	```python
	from feat import Detectorv2
	detector = Detectorv2(device="cuda")
	fex = detector.detect("image.jpg") # returns a py-feat Fex
	```

	The model expects a face crop produced by RetinaFace + py-feat's
	`extract_face_from_bbox_torch(frame, bbox, face_size=256, expand_bbox=1.2)`,
	then center-cropped to 224 and ImageNet-normalized. `Detectorv2` handles this.

	## License

	Research / non-commercial use only. Trained on datasets (AffectNet, DISFA+,
	RAF-DB, Aff-Wild2, BP4D, etc.) whose licenses restrict use to academic research.
	The ConvNeXt-V2 backbone is MIT-licensed. Confirm each constituent dataset's
	terms before any non-research use.