| --- |
| license: mit |
| tags: |
| - facial-expression-recognition |
| - emotion-recognition |
| - computer-vision |
| - pytorch |
| - mediapipe |
| - efficientnet |
| - real-time |
| - image-classification |
| pipeline_tag: image-classification |
| --- |
| |
| <div align="center"> |
|
|
|  |
|
|
| <a href="https://git.io/typing-svg"><img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=06B6D4¢er=true&vCenter=true&width=750&lines=Hybrid+CNN+%2B+MediaPipe+Landmark+Architecture;7+Emotion+Classes+%E2%80%94+Real-Time+at+30+FPS;Bidirectional+Cross-Attention+%7C+EfficientNet-B2+%2B+478+Landmarks;87.9%25+Validation+Accuracy+%7C+Disgust+92%25+Recall" alt="Typing SVG" /></a> |
|
|
| <br/> |
|
|
|  |
|  |
|  |
|  |
|  |
|  |
| [](https://github.com/Huuffy/VisageCNN) |
|
|
| </div> |
|
|
| --- |
|
|
| ## What Is This? |
|
|
| **HybridEmotionNet ** β a dual-branch neural network for real-time facial emotion recognition that fuses **EfficientNet-B2 appearance features** with **MediaPipe 3D landmark geometry** via bidirectional cross-attention. |
|
|
| Processes webcam frames at **30+ FPS**, extracts **478 3D landmarks**, crops the face at 224Γ224, and classifies into 7 emotions with EMA + sliding window temporal smoothing. |
|
|
| ** highlights:** 87.9% validation accuracy Β· Disgust recall 51%β90% Β· Fear recall 65%β75% Β· 75k balanced training images Β· ViT-scored quality filtering |
|
|
| --- |
|
|
| ## Architecture |
|
|
|  |
|
|
| ``` |
| Face crop (224Γ224) βββΊ EfficientNet-B2 βββΊ [B, 256] appearance |
| blocks 0-1 frozen |
| blocks 2-8 fine-tuned |
| |
| 478 landmarks (xyz) βββΊ MLP encoder βββΊ [B, 256] geometry |
| 1434 β 512 β 384 β 256 |
| |
| Bidirectional Cross-Attention (4 heads each) |
| ββββββββββββββββββββββββββββββββββββββββββββ |
| β coord β CNN (geometry queries appear.) β |
| β CNN β coord (appear. queries geometry) β |
| ββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| Fusion MLP: 512 β 384 β 256 β 128 |
| β |
| Classifier: 128 β 7 emotions |
| ``` |
|
|
| | Component | Detail | |
| |-----------|--------| |
| | CNN branch | EfficientNet-B2, ImageNet init, blocks 0β1 frozen, gradient checkpointing | |
| | Coord branch | MLP 1434 β 512 β 384 β 256, BN + Dropout | |
| | Fusion | Bidirectional cross-attention + MLP | |
| | Parameters | ~8M total | |
| | Model size | ~90 MB | |
|
|
| --- |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Validation accuracy | **87.9%** | |
| | Macro F1 | **0.88** | |
| | Inference speed | ~12 ms/frame on RTX 3050 | |
|
|
| | Emotion | Precision | Recall | F1 | |
| |---------|-----------|--------|----| |
| | Angry | 0.85 | 0.83 | 0.84 | |
| | Disgust | 0.97 | 0.90 | 0.94 | |
| | Fear | 0.89 | 0.75 | 0.82 | |
| | Happy | 0.97 | 0.99 | 0.98 | |
| | Neutral | 0.85 | 0.91 | 0.88 | |
| | Sad | 0.78 | 0.88 | 0.83 | |
| | Surprised | 0.83 | 0.90 | 0.86 | |
|
|
| --- |
|
|
| ## Files in This Repo |
|
|
| | File | Size | Required | |
| |------|------|---------| |
| | `models/weights/hybrid_best_model.pth` | ~90 MB | Yes β best macro F1 checkpoint | |
| | `models/weights/hybrid_swa_final.pth` | ~90 MB | Optional β SWA ensemble model | |
| | `models/scalers/hybrid_coordinate_scaler.pkl` | 18 KB | Yes β landmark scaler | |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### 1 β Clone the code |
|
|
| ```bash |
| git clone https://github.com/Huuffy/VisageCNN.git |
| cd VisageCNN |
| python -m venv venv && venv\Scripts\activate |
| pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126 |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 2 β Download weights |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import shutil, pathlib |
| |
| for remote, local in [ |
| ("models/weights/hybrid_best_model.pth", "models/weights/hybrid_best_model.pth"), |
| ("models/weights/hybrid_swa_final.pth", "models/weights/hybrid_swa_final.pth"), |
| ("models/scalers/hybrid_coordinate_scaler.pkl", "models/scalers/hybrid_coordinate_scaler.pkl"), |
| ]: |
| src = hf_hub_download(repo_id="Huuffy/VisageCNN", filename=remote) |
| pathlib.Path(local).parent.mkdir(parents=True, exist_ok=True) |
| shutil.copy(src, local) |
| ``` |
|
|
| Or with the HF CLI: |
| ```bash |
| huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_best_model.pth --local-dir . |
| huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_swa_final.pth --local-dir . |
| huggingface-cli download Huuffy/VisageCNN models/scalers/hybrid_coordinate_scaler.pkl --local-dir . |
| ``` |
|
|
| ### 3 β Run inference |
|
|
| ```bash |
| # Standard |
| python inference/run_hybrid.py |
| |
| # With SWA ensemble |
| python inference/run_hybrid.py --ensemble |
| ``` |
|
|
| Press **Q** to quit. |
|
|
| --- |
|
|
| ## Emotion Classes |
|
|
| | Label | Emotion | Key Signals | |
| |-------|---------|-------------| |
| | 0 | Angry | Furrowed brows, tightened jaw | |
| | 1 | Disgust | Raised upper lip, wrinkled nose | |
| | 2 | Fear | Wide eyes, raised brows, open mouth | |
| | 3 | Happy | Raised cheeks, open smile | |
| | 4 | Neutral | Relaxed, no strong deformation | |
| | 5 | Sad | Lowered brow corners, downturned lips | |
| | 6 | Surprised | Raised brows, wide eyes, dropped jaw | |
|
|
| --- |
|
|
| ## Training Dataset |
|
|
| 75,376 total images β 10,768 per class Γ 7 emotions, perfectly balanced. |
|
|
| **Sources:** AffectNet Β· RAF-DB Β· FER2013 Β· AffectNet-Short Β· ScullyowesHenry Β· RAF-DB Kaggle |
|
|
| All images passed a two-stage quality filter: |
| 1. MediaPipe FaceMesh (dual confidence: 0.5 normal + 0.2 lenient for extreme expressions) |
| 2. ViT confidence scoring (`dima806/facial_emotions_image_detection`) with per-class asymmetric mislabel thresholds |
|
|
| Final class balance achieved via ViT-scored capping β lowest-confidence images removed first, preserving the highest quality examples per class. |
|
|
| --- |
|
|
| ## Training Config |
|
|
| | Setting | Value | |
| |---------|-------| |
| | Loss | Focal Loss Ξ³=2.0 + label smoothing 0.12 | |
| | Optimizer | AdamW, weight decay 0.05 | |
| | LR | OneCycleLR β CNN 5e-5, fusion 5e-4 | |
| | Batch | 96 + grad accumulation Γ2 (eff. 192) | |
| | Augmentation | CutMix + noise + rotation + zoom | |
| | Mixed precision | torch.amp (AMP) | |
| | Best model saved by | Macro F1 (not val accuracy) | |
| | SWA | Epochs 30β70, BN update after training | |
| | Early stopping | patience=15 on macro F1 | |
|
|
| --- |
|
|
| ## Retrain From Scratch |
|
|
| ```bash |
| # Delete old cache and train |
| rmdir /s /q models\cache |
| python scripts/train_hybrid.py |
| ``` |
|
|
| Full guide: [GitHub README](https://github.com/Huuffy/VisageCNN) |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| **Built with curiosity and a lot of training runs** |
|
|
|  |
|
|
| </div> |
|
|