What Is This?

**HybridEmotionNet ** — a dual-branch neural network for real-time facial emotion recognition that fuses EfficientNet-B2 appearance features with MediaPipe 3D landmark geometry via bidirectional cross-attention.

Processes webcam frames at 30+ FPS, extracts 478 3D landmarks, crops the face at 224×224, and classifies into 7 emotions with EMA + sliding window temporal smoothing.

** highlights:** 87.9% validation accuracy · Disgust recall 51%→90% · Fear recall 65%→75% · 75k balanced training images · ViT-scored quality filtering

Architecture

Face crop (224×224) ──► EfficientNet-B2 ──► [B, 256] appearance
                         blocks 0-1 frozen
                         blocks 2-8 fine-tuned

478 landmarks (xyz)  ──► MLP encoder    ──► [B, 256] geometry
                         1434 → 512 → 384 → 256

               Bidirectional Cross-Attention (4 heads each)
               ┌──────────────────────────────────────────┐
               │  coord → CNN  (geometry queries appear.) │
               │  CNN  → coord (appear. queries geometry) │
               └──────────────────────────────────────────┘
                               │
               Fusion MLP: 512 → 384 → 256 → 128
                               │
               Classifier:   128 → 7 emotions

Component	Detail
CNN branch	EfficientNet-B2, ImageNet init, blocks 0–1 frozen, gradient checkpointing
Coord branch	MLP 1434 → 512 → 384 → 256, BN + Dropout
Fusion	Bidirectional cross-attention + MLP
Parameters	~8M total
Model size	~90 MB

Performance

Metric	Value
Validation accuracy	87.9%
Macro F1	0.88
Inference speed	~12 ms/frame on RTX 3050

Emotion	Precision	Recall	F1
Angry	0.85	0.83	0.84
Disgust	0.97	0.90	0.94
Fear	0.89	0.75	0.82
Happy	0.97	0.99	0.98
Neutral	0.85	0.91	0.88
Sad	0.78	0.88	0.83
Surprised	0.83	0.90	0.86

Files in This Repo

File	Size	Required
`models/weights/hybrid_best_model.pth`	~90 MB	Yes — best macro F1 checkpoint
`models/weights/hybrid_swa_final.pth`	~90 MB	Optional — SWA ensemble model
`models/scalers/hybrid_coordinate_scaler.pkl`	18 KB	Yes — landmark scaler

Quick Start

1 — Clone the code

git clone https://github.com/Huuffy/VisageCNN.git
cd VisageCNN
python -m venv venv && venv\Scripts\activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

2 — Download weights

from huggingface_hub import hf_hub_download
import shutil, pathlib

for remote, local in [
    ("models/weights/hybrid_best_model.pth",        "models/weights/hybrid_best_model.pth"),
    ("models/weights/hybrid_swa_final.pth",         "models/weights/hybrid_swa_final.pth"),
    ("models/scalers/hybrid_coordinate_scaler.pkl", "models/scalers/hybrid_coordinate_scaler.pkl"),
]:
    src = hf_hub_download(repo_id="Huuffy/VisageCNN", filename=remote)
    pathlib.Path(local).parent.mkdir(parents=True, exist_ok=True)
    shutil.copy(src, local)

Or with the HF CLI:

huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_best_model.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_swa_final.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/scalers/hybrid_coordinate_scaler.pkl --local-dir .

3 — Run inference

# Standard
python inference/run_hybrid.py

# With SWA ensemble
python inference/run_hybrid.py --ensemble

Press Q to quit.

Emotion Classes

Label	Emotion	Key Signals
0	Angry	Furrowed brows, tightened jaw
1	Disgust	Raised upper lip, wrinkled nose
2	Fear	Wide eyes, raised brows, open mouth
3	Happy	Raised cheeks, open smile
4	Neutral	Relaxed, no strong deformation
5	Sad	Lowered brow corners, downturned lips
6	Surprised	Raised brows, wide eyes, dropped jaw

Training Dataset

75,376 total images — 10,768 per class × 7 emotions, perfectly balanced.

Sources: AffectNet · RAF-DB · FER2013 · AffectNet-Short · ScullyowesHenry · RAF-DB Kaggle

All images passed a two-stage quality filter:

MediaPipe FaceMesh (dual confidence: 0.5 normal + 0.2 lenient for extreme expressions)
ViT confidence scoring (dima806/facial_emotions_image_detection) with per-class asymmetric mislabel thresholds

Final class balance achieved via ViT-scored capping — lowest-confidence images removed first, preserving the highest quality examples per class.

Training Config

Setting	Value
Loss	Focal Loss γ=2.0 + label smoothing 0.12
Optimizer	AdamW, weight decay 0.05
LR	OneCycleLR — CNN 5e-5, fusion 5e-4
Batch	96 + grad accumulation ×2 (eff. 192)
Augmentation	CutMix + noise + rotation + zoom
Mixed precision	torch.amp (AMP)
Best model saved by	Macro F1 (not val accuracy)
SWA	Epochs 30–70, BN update after training
Early stopping	patience=15 on macro F1

Retrain From Scratch

# Delete old cache and train
rmdir /s /q models\cache
python scripts/train_hybrid.py

Full guide: GitHub README

Built with curiosity and a lot of training runs

Downloads last month: -; Downloads are not tracked for this model. How to track