Upload README.md with huggingface_hub

77bd6fa verified 6 days ago

7.82 kB

	---
	license: mit
	tags:
	- facial-expression-recognition
	- emotion-recognition
	- computer-vision
	- pytorch
	- mediapipe
	- efficientnet
	- real-time
	- image-classification
	pipeline_tag: image-classification
	---

	<div align="center">

	![header](https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200&section=header&text=VisageCNN&fontSize=70&fontColor=fff&animation=fadeIn&fontAlignY=38&desc=Real-Time%20Facial%20Expression%20Recognition&descAlignY=60&descAlign=50)

	<a href="https://git.io/typing-svg"><img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=06B6D4&center=true&vCenter=true&width=750&lines=Hybrid+CNN+%2B+MediaPipe+Landmark+Architecture;7+Emotion+Classes+%E2%80%94+Real-Time+at+30+FPS;Bidirectional+Cross-Attention+%7C+EfficientNet-B2+%2B+478+Landmarks;87.9%25+Validation+Accuracy+%7C+Disgust+92%25+Recall" alt="Typing SVG" /></a>

	<br/>

	![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&logo=python&logoColor=white)
	![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)
	![MediaPipe](https://img.shields.io/badge/MediaPipe-0.10-00BCD4?style=for-the-badge&logo=google&logoColor=white)
	![OpenCV](https://img.shields.io/badge/OpenCV-4.x-5C3EE8?style=for-the-badge&logo=opencv&logoColor=white)
	![CUDA](https://img.shields.io/badge/CUDA-11.8+-76B900?style=for-the-badge&logo=nvidia&logoColor=white)
	![License](https://img.shields.io/badge/License-MIT-yellow?style=for-the-badge)
	[![GitHub](https://img.shields.io/badge/GitHub-VisageCNN-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/Huuffy/VisageCNN)

	</div>

	---

	## What Is This?

	HybridEmotionNet — a dual-branch neural network for real-time facial emotion recognition that fuses EfficientNet-B2 appearance features with MediaPipe 3D landmark geometry via bidirectional cross-attention.

	Processes webcam frames at 30+ FPS, extracts 478 3D landmarks, crops the face at 224×224, and classifies into 7 emotions with EMA + sliding window temporal smoothing.

	highlights: 87.9% validation accuracy · Disgust recall 51%→90% · Fear recall 65%→75% · 75k balanced training images · ViT-scored quality filtering

	---

	## Architecture

	![Architecture](https://huggingface.co/Huuffy/VisageCNN/resolve/main/Architecture%20digram.png)

	```
	Face crop (224×224) ──► EfficientNet-B2 ──► [B, 256] appearance
	blocks 0-1 frozen
	blocks 2-8 fine-tuned

	478 landmarks (xyz) ──► MLP encoder ──► [B, 256] geometry
	1434 → 512 → 384 → 256

	Bidirectional Cross-Attention (4 heads each)
	┌──────────────────────────────────────────┐
	│ coord → CNN (geometry queries appear.) │
	│ CNN → coord (appear. queries geometry) │
	└──────────────────────────────────────────┘
	│
	Fusion MLP: 512 → 384 → 256 → 128
	│
	Classifier: 128 → 7 emotions
	```

	\| Component \| Detail \|
	\|-----------\|--------\|
	\| CNN branch \| EfficientNet-B2, ImageNet init, blocks 0–1 frozen, gradient checkpointing \|
	\| Coord branch \| MLP 1434 → 512 → 384 → 256, BN + Dropout \|
	\| Fusion \| Bidirectional cross-attention + MLP \|
	\| Parameters \| ~8M total \|
	\| Model size \| ~90 MB \|

	---

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Validation accuracy \| 87.9% \|
	\| Macro F1 \| 0.88 \|
	\| Inference speed \| ~12 ms/frame on RTX 3050 \|

	\| Emotion \| Precision \| Recall \| F1 \|
	\|---------\|-----------\|--------\|----\|
	\| Angry \| 0.85 \| 0.83 \| 0.84 \|
	\| Disgust \| 0.97 \| 0.90 \| 0.94 \|
	\| Fear \| 0.89 \| 0.75 \| 0.82 \|
	\| Happy \| 0.97 \| 0.99 \| 0.98 \|
	\| Neutral \| 0.85 \| 0.91 \| 0.88 \|
	\| Sad \| 0.78 \| 0.88 \| 0.83 \|
	\| Surprised \| 0.83 \| 0.90 \| 0.86 \|

	---

	## Files in This Repo

	\| File \| Size \| Required \|
	\|------\|------\|---------\|
	\| `models/weights/hybrid_best_model.pth` \| ~90 MB \| Yes — best macro F1 checkpoint \|
	\| `models/weights/hybrid_swa_final.pth` \| ~90 MB \| Optional — SWA ensemble model \|
	\| `models/scalers/hybrid_coordinate_scaler.pkl` \| 18 KB \| Yes — landmark scaler \|

	---

	## Quick Start

	### 1 — Clone the code

	```bash
	git clone https://github.com/Huuffy/VisageCNN.git
	cd VisageCNN
	python -m venv venv && venv\Scripts\activate
	pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
	pip install -r requirements.txt
	```

	### 2 — Download weights

	```python
	from huggingface_hub import hf_hub_download
	import shutil, pathlib

	for remote, local in [
	("models/weights/hybrid_best_model.pth", "models/weights/hybrid_best_model.pth"),
	("models/weights/hybrid_swa_final.pth", "models/weights/hybrid_swa_final.pth"),
	("models/scalers/hybrid_coordinate_scaler.pkl", "models/scalers/hybrid_coordinate_scaler.pkl"),
	]:
	src = hf_hub_download(repo_id="Huuffy/VisageCNN", filename=remote)
	pathlib.Path(local).parent.mkdir(parents=True, exist_ok=True)
	shutil.copy(src, local)
	```

	Or with the HF CLI:
	```bash
	huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_best_model.pth --local-dir .
	huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_swa_final.pth --local-dir .
	huggingface-cli download Huuffy/VisageCNN models/scalers/hybrid_coordinate_scaler.pkl --local-dir .
	```

	### 3 — Run inference

	```bash
	# Standard
	python inference/run_hybrid.py

	# With SWA ensemble
	python inference/run_hybrid.py --ensemble
	```

	Press Q to quit.

	---

	## Emotion Classes

	\| Label \| Emotion \| Key Signals \|
	\|-------\|---------\|-------------\|
	\| 0 \| Angry \| Furrowed brows, tightened jaw \|
	\| 1 \| Disgust \| Raised upper lip, wrinkled nose \|
	\| 2 \| Fear \| Wide eyes, raised brows, open mouth \|
	\| 3 \| Happy \| Raised cheeks, open smile \|
	\| 4 \| Neutral \| Relaxed, no strong deformation \|
	\| 5 \| Sad \| Lowered brow corners, downturned lips \|
	\| 6 \| Surprised \| Raised brows, wide eyes, dropped jaw \|

	---

	## Training Dataset

	75,376 total images — 10,768 per class × 7 emotions, perfectly balanced.

	Sources: AffectNet · RAF-DB · FER2013 · AffectNet-Short · ScullyowesHenry · RAF-DB Kaggle

	All images passed a two-stage quality filter:
	1. MediaPipe FaceMesh (dual confidence: 0.5 normal + 0.2 lenient for extreme expressions)
	2. ViT confidence scoring (`dima806/facial_emotions_image_detection`) with per-class asymmetric mislabel thresholds

	Final class balance achieved via ViT-scored capping — lowest-confidence images removed first, preserving the highest quality examples per class.

	---

	## Training Config

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Loss \| Focal Loss γ=2.0 + label smoothing 0.12 \|
	\| Optimizer \| AdamW, weight decay 0.05 \|
	\| LR \| OneCycleLR — CNN 5e-5, fusion 5e-4 \|
	\| Batch \| 96 + grad accumulation ×2 (eff. 192) \|
	\| Augmentation \| CutMix + noise + rotation + zoom \|
	\| Mixed precision \| torch.amp (AMP) \|
	\| Best model saved by \| Macro F1 (not val accuracy) \|
	\| SWA \| Epochs 30–70, BN update after training \|
	\| Early stopping \| patience=15 on macro F1 \|

	---

	## Retrain From Scratch

	```bash
	# Delete old cache and train
	rmdir /s /q models\cache
	python scripts/train_hybrid.py
	```

	Full guide: [GitHub README](https://github.com/Huuffy/VisageCNN)

	---

	<div align="center">

	Built with curiosity and a lot of training runs

	![footer](https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=120&section=footer)

	</div>