FaceEmo-Set: A Balanced and Diverse Dataset for Facial Emotion Recognition
Model Description
This repository contains the FaceEmo-Set models trained using Vision Transformer (ViT) architecture for facial emotion recognition across seven basic emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise.
We provide two model variants:
- FaceEmo-Set ViT Model: Trained exclusively on FaceEmo-Set (25,200 balanced images)
- Combined ViT Model: Trained on FaceEmo-Set + FER2013 + RAF-DB + RAVDESS for enhanced cross-dataset generalization
Key Features
- ✅ Balanced representation across all seven emotion categories
- ✅ Superior minority-class performance (disgust, fear, sadness)
- ✅ Cross-dataset generalization validated on AffectNet and FER2013
- ✅ Vision Transformer architecture (ViT-Base/16) with full fine-tuning
Model Performance
FaceEmo-Set Model Performance
| Test Dataset | Overall Accuracy | Angry | Disgust | Fear | Happy | Neutral | Sad | Surprise |
|---|---|---|---|---|---|---|---|---|
| AffectNet | 60.84% | 0.50 | 0.75 | 0.53 | 0.87 | 0.60 | 0.44 | 0.58 |
| FER2013 | 56.19% | 0.42 | 0.70 | 0.24 | 0.83 | 0.62 | 0.62 | 0.71 |
Combined Model Performance
| Test Dataset | Overall Accuracy | Angry | Disgust | Fear | Happy | Neutral | Sad | Surprise |
|---|---|---|---|---|---|---|---|---|
| AffectNet | 65.27% | 0.62 | 0.79 | 0.63 | 0.93 | 0.39 | 0.53 | 0.69 |
| FER2013 | 69.32% | 0.56 | 0.64 | 0.55 | 0.93 | 0.54 | 0.65 | 0.82 |
Notable Achievement: FaceEmo-Set achieves 0.75 recall for disgust on AffectNet, dramatically outperforming models trained on imbalanced datasets (FER2013: 0.04 recall).
Dataset Description
FaceEmo-Set is a carefully curated dataset comprising 25,200 validated images (from 28,979 collected) designed to address critical limitations in existing FER datasets:
Design Principles
- Class Balance: Near-uniform distribution (3,300-3,900 samples per emotion, max ratio 1.2:1)
- Source Diversity: Multi-source integration (movies, TV shows, GIFs, internet images, AI-generated content, established datasets)
- Quality Variance: Deliberate inclusion of varying resolutions, lighting, and acquisition conditions
Data Sources
- Dynamic media (movies, TV shows, GIFs)
- Static internet images
- AI-generated content
- Samples from FER2013, CREMA-D, and RAVDESS (200, 100, 100 per emotion respectively)
Validation Protocol
- Multi-annotator validation (3 annotators per image)
- Majority agreement required (≥2/3 consensus)
- 13% rejection rate (3,779 ambiguous/low-quality images excluded)
Emotion Distribution
| Emotion | Training | Validation | Total |
|---|---|---|---|
| Anger | 3,372 | 596 | 3,968 |
| Disgust | 2,985 | 527 | 3,512 |
| Fear | 2,807 | 496 | 3,303 |
| Happiness | 3,231 | 571 | 3,802 |
| Neutral | 3,130 | 553 | 3,683 |
| Sadness | 3,086 | 545 | 3,631 |
| Surprise | 2,805 | 496 | 3,301 |
| Total | 21,416 | 3,784 | 25,200 |
Visual Overview
Dataset Creation Pipeline
Confusion Matrices
FaceEmo-Set model on AffectNet

Available Files
Model Weights
FaceEmo-Set_ViT_model_weights.pth- FaceEmo-Set standalone modelcomb_data_ViT_model_weights.pth- Combined multi-dataset model
Predictions and Results
FaceEmo-Set_train_FER2013_test_predictions.csvFaceEmo-Set_train_AffectNet_test_predictions.csvcomb_data_train_FER2013_test_predictions.csvcomb_data_train_AffectNet_test_predictions.csv
Visualizations
FaceEmoset_Creation_pipeline.png- Dataset construction pipelineFaceEmo-Set_train_FER2013_test_confusion_m.png- Confusion matrix (FER2013)FaceEmo-Set_train_AffectNet_test_confusion_m.png- Confusion matrix (AffectNet)
Dataset Features
- Features are available at: https://huggingface.co/datasets/jihedjabnoun/faceemo-set-features
FaceEmo-Set_Data_features.npz- Extracted feature representationscombined_vit_metadata_all.csv- Meta data
Note: Raw image data cannot be released due to copyright restrictions on source materials.
Quick Start
Installation
pip install torch torchvision transformers pillow huggingface_hub
Single Image Inference
import warnings, torch
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, ViTConfig, ViTForImageClassification
from transformers.utils import logging as hf_logging
from PIL import Image
from IPython.display import display, Markdown
warnings.filterwarnings("ignore")
hf_logging.set_verbosity_error()
REPO_ID = "jihedjabnoun/faceemo-set"
MODEL_TYPE = "combined" # "faceemo" or "combined"
FILES = {
"faceemo": ("FaceEmo-Set_ViT_model_weights.pth", "FaceEmo-Set_ViT_model"),
"combined": ("comb_data_ViT_model_weights.pth", "Combined_ViT_model")
}
MODEL_FILE, MODEL_NAME = FILES[MODEL_TYPE]
EMOTIONS = ["anger","disgust","fear","happiness","neutral","sadness","surprise"]
IMAGE_PATH = "9.png"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k", use_fast=True)
config = ViTConfig.from_pretrained("google/vit-base-patch16-224-in21k", num_labels=len(EMOTIONS))
model = ViTForImageClassification(config)
weights_path = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)
state = torch.load(weights_path, map_location="cpu")
if any(k.startswith("module.") for k in state):
state = {k.replace("module.", "", 1): v for k, v in state.items()}
model.load_state_dict(state, strict=True)
model.to(device).eval()
print(f"✅ {MODEL_NAME} loaded successfully")
image = Image.open(IMAGE_PATH).convert("RGB")
display(image)
inputs = processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits, dim=1)[0]
top = torch.topk(probs, 3)
lines = []
for i, (idx, p) in enumerate(zip(top.indices.tolist(), top.values.tolist()), 1):
lines.append(f"{i}. **{EMOTIONS[idx]}** — `{p:.2%}`")
display(Markdown("### Prediction (Top-3)\n" + "\n".join(lines)))
Batch Inference
import warnings, torch
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, ViTConfig, ViTForImageClassification
from transformers.utils import logging as hf_logging
from PIL import Image
from torch.utils.data import Dataset, DataLoader
import pandas as pd
warnings.filterwarnings("ignore")
hf_logging.set_verbosity_error()
REPO_ID = "jihedjabnoun/faceemo-set"
MODEL_TYPE = "faceemo" # "faceemo" or "combined"
FILES = {
"faceemo": ("FaceEmo-Set_ViT_model_weights.pth", "FaceEmo-Set_ViT_model"),
"combined": ("comb_data_ViT_model_weights.pth", "Combined_ViT_model")
}
MODEL_FILE, MODEL_NAME = FILES[MODEL_TYPE]
EMOTIONS = ["anger","disgust","fear","happiness","neutral","sadness","surprise"]
image_paths = ["9.png", "C.png"]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k", use_fast=True)
config = ViTConfig.from_pretrained("google/vit-base-patch16-224-in21k", num_labels=len(EMOTIONS))
model = ViTForImageClassification(config)
weights_path = hf_hub_download(repo_id=REPO_ID, filename=MODEL_FILE)
state = torch.load(weights_path, map_location="cpu")
if any(k.startswith("module.") for k in state):
state = {k.replace("module.", "", 1): v for k, v in state.items()}
model.load_state_dict(state, strict=True)
model.to(device).eval()
print(f"✅ {MODEL_NAME} loaded successfully")
class ImgListDataset(Dataset):
def __init__(self, paths):
self.paths = paths
def __len__(self):
return len(self.paths)
def __getitem__(self, idx):
image = Image.open(self.paths[idx]).convert("RGB").resize((224, 224))
pixel = processor(images=image, return_tensors="pt")["pixel_values"].squeeze(0)
return pixel, self.paths[idx]
loader = DataLoader(ImgListDataset(image_paths), batch_size=32, shuffle=False)
rows = []
with torch.no_grad():
for pixels, paths in loader:
pixels = pixels.to(device)
probs = torch.softmax(model(pixels).logits, dim=1)
top = torch.topk(probs, 3, dim=1)
for path, idxs, vals in zip(paths, top.indices.cpu().tolist(), top.values.cpu().tolist()):
rows.append({
"image": path,
"top1": f"{EMOTIONS[idxs[0]]} ({vals[0]:.2%})",
"top2": f"{EMOTIONS[idxs[1]]} ({vals[1]:.2%})",
"top3": f"{EMOTIONS[idxs[2]]} ({vals[2]:.2%})",
})
display(pd.DataFrame(rows))
Model Architecture
- Base Model: Vision Transformer (ViT-Base/16)
- Pre-training: ImageNet-21k
- Input Size: 224×224 pixels
- Patch Size: 16×16 pixels
- Fine-tuning: Full parameter fine-tuning
- Optimizer: Adam (learning rate: 5×10⁻⁵)
- Loss Function: Cross-Entropy Loss
- Batch Size: 32
Use Cases
- Human-Computer Interaction: Emotion-aware interfaces
- Mental Health Monitoring: Depression and anxiety screening
- Customer Service: Sentiment analysis in video calls
- Education: Student engagement monitoring
- Entertainment: Audience reaction analysis
- Security: Suspicious behavior detection
Limitations
- Static image analysis only (no temporal modeling)
- Seven basic emotions (no compound emotions or intensity levels)
- May inherit biases from source datasets
- Performance varies on extreme poses or occlusions
- Annotation inconsistencies possible across integrated datasets
- Requires face detection preprocessing: Images should contain cropped face regions, not full scenes
Citation
@inproceedings{jabnoun2026improving,
title={Improving Cross-Dataset Generalization in Facial Emotion Recognition Through FaceEmo-Set: A Balanced and Diverse Dataset},
author={Jabnoun, Jihed and Maraoui, Mohsen and Zrigui, Mounir},
booktitle={Asian Conference on Intelligent Information and Database Systems},
pages={355--369},
year={2026},
organization={Springer}
}
License
MIT License
Acknowledgments
This work was conducted at the Research Laboratory in Algebra, Numbers Theory and Intelligent Systems, University of Monastir, Tunisia.
Contact
- Jihed Jabnoun: jihed.jabnoun@gmail.com
- Mohsen Maraoui
- Mounir Zrigui


