header

Typing SVG


Python PyTorch MediaPipe OpenCV CUDA License GitHub


What Is This?

**HybridEmotionNet ** β€” a dual-branch neural network for real-time facial emotion recognition that fuses EfficientNet-B2 appearance features with MediaPipe 3D landmark geometry via bidirectional cross-attention.

Processes webcam frames at 30+ FPS, extracts 478 3D landmarks, crops the face at 224Γ—224, and classifies into 7 emotions with EMA + sliding window temporal smoothing.

** highlights:** 87.9% validation accuracy Β· Disgust recall 51%β†’90% Β· Fear recall 65%β†’75% Β· 75k balanced training images Β· ViT-scored quality filtering


Architecture

Architecture

Face crop (224Γ—224) ──► EfficientNet-B2 ──► [B, 256] appearance
                         blocks 0-1 frozen
                         blocks 2-8 fine-tuned

478 landmarks (xyz)  ──► MLP encoder    ──► [B, 256] geometry
                         1434 β†’ 512 β†’ 384 β†’ 256

               Bidirectional Cross-Attention (4 heads each)
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚  coord β†’ CNN  (geometry queries appear.) β”‚
               β”‚  CNN  β†’ coord (appear. queries geometry) β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
               Fusion MLP: 512 β†’ 384 β†’ 256 β†’ 128
                               β”‚
               Classifier:   128 β†’ 7 emotions
Component Detail
CNN branch EfficientNet-B2, ImageNet init, blocks 0–1 frozen, gradient checkpointing
Coord branch MLP 1434 β†’ 512 β†’ 384 β†’ 256, BN + Dropout
Fusion Bidirectional cross-attention + MLP
Parameters ~8M total
Model size ~90 MB

Performance

Metric Value
Validation accuracy 87.9%
Macro F1 0.88
Inference speed ~12 ms/frame on RTX 3050
Emotion Precision Recall F1
Angry 0.85 0.83 0.84
Disgust 0.97 0.90 0.94
Fear 0.89 0.75 0.82
Happy 0.97 0.99 0.98
Neutral 0.85 0.91 0.88
Sad 0.78 0.88 0.83
Surprised 0.83 0.90 0.86

Files in This Repo

File Size Required
models/weights/hybrid_best_model.pth ~90 MB Yes β€” best macro F1 checkpoint
models/weights/hybrid_swa_final.pth ~90 MB Optional β€” SWA ensemble model
models/scalers/hybrid_coordinate_scaler.pkl 18 KB Yes β€” landmark scaler

Quick Start

1 β€” Clone the code

git clone https://github.com/Huuffy/VisageCNN.git
cd VisageCNN
python -m venv venv && venv\Scripts\activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

2 β€” Download weights

from huggingface_hub import hf_hub_download
import shutil, pathlib

for remote, local in [
    ("models/weights/hybrid_best_model.pth",        "models/weights/hybrid_best_model.pth"),
    ("models/weights/hybrid_swa_final.pth",         "models/weights/hybrid_swa_final.pth"),
    ("models/scalers/hybrid_coordinate_scaler.pkl", "models/scalers/hybrid_coordinate_scaler.pkl"),
]:
    src = hf_hub_download(repo_id="Huuffy/VisageCNN", filename=remote)
    pathlib.Path(local).parent.mkdir(parents=True, exist_ok=True)
    shutil.copy(src, local)

Or with the HF CLI:

huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_best_model.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/weights/hybrid_swa_final.pth --local-dir .
huggingface-cli download Huuffy/VisageCNN models/scalers/hybrid_coordinate_scaler.pkl --local-dir .

3 β€” Run inference

# Standard
python inference/run_hybrid.py

# With SWA ensemble
python inference/run_hybrid.py --ensemble

Press Q to quit.


Emotion Classes

Label Emotion Key Signals
0 Angry Furrowed brows, tightened jaw
1 Disgust Raised upper lip, wrinkled nose
2 Fear Wide eyes, raised brows, open mouth
3 Happy Raised cheeks, open smile
4 Neutral Relaxed, no strong deformation
5 Sad Lowered brow corners, downturned lips
6 Surprised Raised brows, wide eyes, dropped jaw

Training Dataset

75,376 total images β€” 10,768 per class Γ— 7 emotions, perfectly balanced.

Sources: AffectNet Β· RAF-DB Β· FER2013 Β· AffectNet-Short Β· ScullyowesHenry Β· RAF-DB Kaggle

All images passed a two-stage quality filter:

  1. MediaPipe FaceMesh (dual confidence: 0.5 normal + 0.2 lenient for extreme expressions)
  2. ViT confidence scoring (dima806/facial_emotions_image_detection) with per-class asymmetric mislabel thresholds

Final class balance achieved via ViT-scored capping β€” lowest-confidence images removed first, preserving the highest quality examples per class.


Training Config

Setting Value
Loss Focal Loss Ξ³=2.0 + label smoothing 0.12
Optimizer AdamW, weight decay 0.05
LR OneCycleLR β€” CNN 5e-5, fusion 5e-4
Batch 96 + grad accumulation Γ—2 (eff. 192)
Augmentation CutMix + noise + rotation + zoom
Mixed precision torch.amp (AMP)
Best model saved by Macro F1 (not val accuracy)
SWA Epochs 30–70, BN update after training
Early stopping patience=15 on macro F1

Retrain From Scratch

# Delete old cache and train
rmdir /s /q models\cache
python scripts/train_hybrid.py

Full guide: GitHub README


Built with curiosity and a lot of training runs

footer

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support