File size: 7,815 Bytes
77bd6fa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | ---
license: mit
tags:
- facial-expression-recognition
- emotion-recognition
- computer-vision
- pytorch
- mediapipe
- efficientnet
- real-time
- image-classification
pipeline_tag: image-classification
---
<div align="center">

<a href="https://git.io/typing-svg"><img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=06B6D4¢er=true&vCenter=true&width=750&lines=Hybrid+CNN+%2B+MediaPipe+Landmark+Architecture;7+Emotion+Classes+%E2%80%94+Real-Time+at+30+FPS;Bidirectional+Cross-Attention+%7C+EfficientNet-B2+%2B+478+Landmarks;87.9%25+Validation+Accuracy+%7C+Disgust+92%25+Recall" alt="Typing SVG" /></a>
<br/>






[](https://github.com/Huuffy/VisageCNN)
</div>
---
## What Is This?
**HybridEmotionNet ** β a dual-branch neural network for real-time facial emotion recognition that fuses **EfficientNet-B2 appearance features** with **MediaPipe 3D landmark geometry** via bidirectional cross-attention.
Processes webcam frames at **30+ FPS**, extracts **478 3D landmarks**, crops the face at 224Γ224, and classifies into 7 emotions with EMA + sliding window temporal smoothing.
** highlights:** 87.9% validation accuracy Β· Disgust recall 51%β90% Β· Fear recall 65%β75% Β· 75k balanced training images Β· ViT-scored quality filtering
---
## Architecture

```
Face crop (224Γ224) βββΊ EfficientNet-B2 βββΊ [B, 256] appearance
blocks 0-1 frozen
blocks 2-8 fine-tuned
478 landmarks (xyz) βββΊ MLP encoder βββΊ [B, 256] geometry
1434 β 512 β 384 β 256
Bidirectional Cross-Attention (4 heads each)
ββββββββββββββββββββββββββββββββββββββββββββ
β coord β CNN (geometry queries appear.) β
β CNN β coord (appear. queries geometry) β
ββββββββββββββββββββββββββββββββββββββββββββ
β
Fusion MLP: 512 β 384 β 256 β 128
β
Classifier: 128 β 7 emotions
```
| Component | Detail |
|-----------|--------|
| CNN branch | EfficientNet-B2, ImageNet init, blocks 0β1 frozen, gradient checkpointing |
| Coord branch | MLP 1434 β 512 β 384 β 256, BN + Dropout |
| Fusion | Bidirectional cross-attention + MLP |
| Parameters | ~8M total |
| Model size | ~90 MB |
---
## Performance
| Metric | Value |
|--------|-------|
| Validation accuracy | **87.9%** |
| Macro F1 | **0.88** |
| Inference speed | ~12 ms/frame on RTX 3050 |
| Emotion | Precision | Recall | F1 |
|---------|-----------|--------|----|
| Angry | 0.85 | 0.83 | 0.84 |
| Disgust | 0.97 | 0.90 | 0.94 |
| Fear | 0.89 | 0.75 | 0.82 |
| Happy | 0.97 | 0.99 | 0.98 |
| Neutral | 0.85 | 0.91 | 0.88 |
| Sad | 0.78 | 0.88 | 0.83 |
| Surprised | 0.83 | 0.90 | 0.86 |
---
## Files in This Repo
| File | Size | Required |
|------|------|---------|
| `models/weights/hybrid_best_model.pth` | ~90 MB | Yes β best macro F1 checkpoint |
| `models/weights/hybrid_swa_final.pth` | ~90 MB | Optional β SWA ensemble model |
| `models/scalers/hybrid_coordinate_scaler.pkl` | 18 KB | Yes β landmark scaler |
---
## Quick Start
### 1 β Clone the code
```bash
git clone https://github.com/Huuffy/VisageCNN.git
cd VisageCNN
python -m venv venv && venv\Scripts\activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
```
### 2 β Download weights
```python
from huggingface_hub import hf_hub_download
import shutil, pathlib
for remote, local in [
("models/weights/hybrid_best_model.pth", "models/weights/hybrid_best_model.pth"),
("models/weights/hybrid_swa_final.pth", "models/weights/hybrid_swa_final.pth"),
("models/scalers/hybrid_coordinate_scaler.pkl", "models/scalers/hybrid_coordinate_scaler.pkl"),
]:
src = hf_hub_download(repo_id="Huuffy/VisageCNN", filename=remote)
pathlib.Path(local).parent.mkdir(parents=True, exist_ok=True)
shutil.copy(src, local)
```
Or with the HF CLI:
```bash
huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_best_model.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_swa_final.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/scalers/hybrid_coordinate_scaler.pkl --local-dir .
```
### 3 β Run inference
```bash
# Standard
python inference/run_hybrid.py
# With SWA ensemble
python inference/run_hybrid.py --ensemble
```
Press **Q** to quit.
---
## Emotion Classes
| Label | Emotion | Key Signals |
|-------|---------|-------------|
| 0 | Angry | Furrowed brows, tightened jaw |
| 1 | Disgust | Raised upper lip, wrinkled nose |
| 2 | Fear | Wide eyes, raised brows, open mouth |
| 3 | Happy | Raised cheeks, open smile |
| 4 | Neutral | Relaxed, no strong deformation |
| 5 | Sad | Lowered brow corners, downturned lips |
| 6 | Surprised | Raised brows, wide eyes, dropped jaw |
---
## Training Dataset
75,376 total images β 10,768 per class Γ 7 emotions, perfectly balanced.
**Sources:** AffectNet Β· RAF-DB Β· FER2013 Β· AffectNet-Short Β· ScullyowesHenry Β· RAF-DB Kaggle
All images passed a two-stage quality filter:
1. MediaPipe FaceMesh (dual confidence: 0.5 normal + 0.2 lenient for extreme expressions)
2. ViT confidence scoring (`dima806/facial_emotions_image_detection`) with per-class asymmetric mislabel thresholds
Final class balance achieved via ViT-scored capping β lowest-confidence images removed first, preserving the highest quality examples per class.
---
## Training Config
| Setting | Value |
|---------|-------|
| Loss | Focal Loss Ξ³=2.0 + label smoothing 0.12 |
| Optimizer | AdamW, weight decay 0.05 |
| LR | OneCycleLR β CNN 5e-5, fusion 5e-4 |
| Batch | 96 + grad accumulation Γ2 (eff. 192) |
| Augmentation | CutMix + noise + rotation + zoom |
| Mixed precision | torch.amp (AMP) |
| Best model saved by | Macro F1 (not val accuracy) |
| SWA | Epochs 30β70, BN update after training |
| Early stopping | patience=15 on macro F1 |
---
## Retrain From Scratch
```bash
# Delete old cache and train
rmdir /s /q models\cache
python scripts/train_hybrid.py
```
Full guide: [GitHub README](https://github.com/Huuffy/VisageCNN)
---
<div align="center">
**Built with curiosity and a lot of training runs**

</div>
|