File size: 7,815 Bytes
77bd6fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
license: mit
tags:
  - facial-expression-recognition
  - emotion-recognition
  - computer-vision
  - pytorch
  - mediapipe
  - efficientnet
  - real-time
  - image-classification
pipeline_tag: image-classification
---

<div align="center">

![header](https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200&section=header&text=VisageCNN&fontSize=70&fontColor=fff&animation=fadeIn&fontAlignY=38&desc=Real-Time%20Facial%20Expression%20Recognition&descAlignY=60&descAlign=50)

<a href="https://git.io/typing-svg"><img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=06B6D4&center=true&vCenter=true&width=750&lines=Hybrid+CNN+%2B+MediaPipe+Landmark+Architecture;7+Emotion+Classes+%E2%80%94+Real-Time+at+30+FPS;Bidirectional+Cross-Attention+%7C+EfficientNet-B2+%2B+478+Landmarks;87.9%25+Validation+Accuracy+%7C+Disgust+92%25+Recall" alt="Typing SVG" /></a>

<br/>

![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&logo=python&logoColor=white)
![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)
![MediaPipe](https://img.shields.io/badge/MediaPipe-0.10-00BCD4?style=for-the-badge&logo=google&logoColor=white)
![OpenCV](https://img.shields.io/badge/OpenCV-4.x-5C3EE8?style=for-the-badge&logo=opencv&logoColor=white)
![CUDA](https://img.shields.io/badge/CUDA-11.8+-76B900?style=for-the-badge&logo=nvidia&logoColor=white)
![License](https://img.shields.io/badge/License-MIT-yellow?style=for-the-badge)
[![GitHub](https://img.shields.io/badge/GitHub-VisageCNN-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/Huuffy/VisageCNN)

</div>

---

## What Is This?

**HybridEmotionNet ** β€” a dual-branch neural network for real-time facial emotion recognition that fuses **EfficientNet-B2 appearance features** with **MediaPipe 3D landmark geometry** via bidirectional cross-attention.

Processes webcam frames at **30+ FPS**, extracts **478 3D landmarks**, crops the face at 224Γ—224, and classifies into 7 emotions with EMA + sliding window temporal smoothing.

** highlights:** 87.9% validation accuracy Β· Disgust recall 51%β†’90% Β· Fear recall 65%β†’75% Β· 75k balanced training images Β· ViT-scored quality filtering

---

## Architecture

![Architecture](https://huggingface.co/Huuffy/VisageCNN/resolve/main/Architecture%20digram.png)

```
Face crop (224Γ—224) ──► EfficientNet-B2 ──► [B, 256] appearance
                         blocks 0-1 frozen
                         blocks 2-8 fine-tuned

478 landmarks (xyz)  ──► MLP encoder    ──► [B, 256] geometry
                         1434 β†’ 512 β†’ 384 β†’ 256

               Bidirectional Cross-Attention (4 heads each)
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚  coord β†’ CNN  (geometry queries appear.) β”‚
               β”‚  CNN  β†’ coord (appear. queries geometry) β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
               Fusion MLP: 512 β†’ 384 β†’ 256 β†’ 128
                               β”‚
               Classifier:   128 β†’ 7 emotions
```

| Component | Detail |
|-----------|--------|
| CNN branch | EfficientNet-B2, ImageNet init, blocks 0–1 frozen, gradient checkpointing |
| Coord branch | MLP 1434 β†’ 512 β†’ 384 β†’ 256, BN + Dropout |
| Fusion | Bidirectional cross-attention + MLP |
| Parameters | ~8M total |
| Model size | ~90 MB |

---

## Performance 

| Metric | Value |
|--------|-------|
| Validation accuracy | **87.9%** |
| Macro F1 | **0.88** |
| Inference speed | ~12 ms/frame on RTX 3050 |

| Emotion | Precision | Recall | F1 |
|---------|-----------|--------|----|
| Angry | 0.85 | 0.83 | 0.84 |
| Disgust | 0.97 | 0.90 | 0.94 |
| Fear | 0.89 | 0.75 | 0.82 |
| Happy | 0.97 | 0.99 | 0.98 |
| Neutral | 0.85 | 0.91 | 0.88 |
| Sad | 0.78 | 0.88 | 0.83 |
| Surprised | 0.83 | 0.90 | 0.86 |

---

## Files in This Repo

| File | Size | Required |
|------|------|---------|
| `models/weights/hybrid_best_model.pth` | ~90 MB | Yes β€” best macro F1 checkpoint |
| `models/weights/hybrid_swa_final.pth` | ~90 MB | Optional β€” SWA ensemble model |
| `models/scalers/hybrid_coordinate_scaler.pkl` | 18 KB | Yes β€” landmark scaler |

---

## Quick Start

### 1 β€” Clone the code

```bash
git clone https://github.com/Huuffy/VisageCNN.git
cd VisageCNN
python -m venv venv && venv\Scripts\activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
```

### 2 β€” Download weights

```python
from huggingface_hub import hf_hub_download
import shutil, pathlib

for remote, local in [
    ("models/weights/hybrid_best_model.pth",        "models/weights/hybrid_best_model.pth"),
    ("models/weights/hybrid_swa_final.pth",         "models/weights/hybrid_swa_final.pth"),
    ("models/scalers/hybrid_coordinate_scaler.pkl", "models/scalers/hybrid_coordinate_scaler.pkl"),
]:
    src = hf_hub_download(repo_id="Huuffy/VisageCNN", filename=remote)
    pathlib.Path(local).parent.mkdir(parents=True, exist_ok=True)
    shutil.copy(src, local)
```

Or with the HF CLI:
```bash
huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_best_model.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_swa_final.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/scalers/hybrid_coordinate_scaler.pkl --local-dir .
```

### 3 β€” Run inference

```bash
# Standard
python inference/run_hybrid.py

# With SWA ensemble
python inference/run_hybrid.py --ensemble
```

Press **Q** to quit.

---

## Emotion Classes

| Label | Emotion | Key Signals |
|-------|---------|-------------|
| 0 | Angry | Furrowed brows, tightened jaw |
| 1 | Disgust | Raised upper lip, wrinkled nose |
| 2 | Fear | Wide eyes, raised brows, open mouth |
| 3 | Happy | Raised cheeks, open smile |
| 4 | Neutral | Relaxed, no strong deformation |
| 5 | Sad | Lowered brow corners, downturned lips |
| 6 | Surprised | Raised brows, wide eyes, dropped jaw |

---

## Training Dataset 

75,376 total images β€” 10,768 per class Γ— 7 emotions, perfectly balanced.

**Sources:** AffectNet Β· RAF-DB Β· FER2013 Β· AffectNet-Short Β· ScullyowesHenry Β· RAF-DB Kaggle

All images passed a two-stage quality filter:
1. MediaPipe FaceMesh (dual confidence: 0.5 normal + 0.2 lenient for extreme expressions)
2. ViT confidence scoring (`dima806/facial_emotions_image_detection`) with per-class asymmetric mislabel thresholds

Final class balance achieved via ViT-scored capping β€” lowest-confidence images removed first, preserving the highest quality examples per class.

---

## Training Config

| Setting | Value |
|---------|-------|
| Loss | Focal Loss Ξ³=2.0 + label smoothing 0.12 |
| Optimizer | AdamW, weight decay 0.05 |
| LR | OneCycleLR β€” CNN 5e-5, fusion 5e-4 |
| Batch | 96 + grad accumulation Γ—2 (eff. 192) |
| Augmentation | CutMix + noise + rotation + zoom |
| Mixed precision | torch.amp (AMP) |
| Best model saved by | Macro F1 (not val accuracy) |
| SWA | Epochs 30–70, BN update after training |
| Early stopping | patience=15 on macro F1 |

---

## Retrain From Scratch

```bash
# Delete old cache and train
rmdir /s /q models\cache
python scripts/train_hybrid.py
```

Full guide: [GitHub README](https://github.com/Huuffy/VisageCNN)

---

<div align="center">

**Built with curiosity and a lot of training runs**

![footer](https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=120&section=footer)

</div>