Face Mask Detection
trained 3 object detection models to detect face masks on the kaggle face mask detection dataset. compared YOLOv11, Faster R-CNN and RF-DETR to see which one works best
quick start
pip install -r requirements.txt
# run inference on an image
python inference.py --image your_image.jpg
# run on folder of images
python inference.py --folder ./test_images/ --save
the default model is faster r-cnn since it got the best accuracy. if you want to use yolo instead
python inference.py --image test.jpg --model yolo
what i tried
the main goal was to find the best model for face mask detection so i trained multiple architectures and ran experiments to see what actually helps
i compared 3 detection architectures - YOLOv11 (all 5 sizes from nano to xlarge), Faster R-CNN with ResNet50-FPN backbone, and RF-DETR which is a newer transformer based detector. for each architecture i trained on the same augmented dataset to keep things fair. the augmentation pipeline uses albumentations with horizontal flip, color jitter, random scaling and gaussian blur. i augmented the training set 3x (original plus 2 augmented versions) while keeping validation and test sets clean
i also ran two additional experiments. first was image preprocessing where i applied CLAHE contrast enhancement and gray world white balance before augmentation, thinking it would help with lighting variations in the images. second was test time augmentation (TTA) where i ran inference multiple times with different transforms (flips, scales, rotations) and merged the predictions. tested TTA levels from 1x up to 8x transforms. both experiments gave surprising results which ill explain below
results
| model | mAP@0.5 | F1 | speed |
|---|---|---|---|
| Faster R-CNN | 87.2% | 0.894 | 26ms |
| YOLOv11m | 80.7% | 0.779 | 39ms |
| YOLOv11x | 76.6% | 0.771 | 45ms |
| RF-DETR | 71.1% | 0.770 | 673ms |
faster r-cnn won with 87% mAP. yolo variants are faster but slightly less accurate. rf-detr is way too slow for real time use
per class breakdown for faster r-cnn
- with_mask: 94% mAP (easy, most common)
- without_mask: 94% mAP (also good)
- mask_weared_incorrect: 74% mAP (hardest, only 3% of data)
things that didnt work
being honest here about what failed because i think this shows the experimentation process
preprocessing hurt performance
i tried applying CLAHE and white balance before augmentation thinking it would help normalize lighting conditions across images. ran the full training pipeline again with preprocessed data (called this v3). turns out it made things worse - faster r-cnn dropped 2.5% mAP (from 87.2% to 84.6%), yolo dropped about 1.2%. the only one barely affected was rf-detr. my guess is the preprocessing removed some texture information that was actually useful for detection, or the augmentation was already handling lighting variations well enough
TTA was counterproductive
test time augmentation is supposed to improve accuracy by running inference multiple times with different transforms and merging predictions. i tested 8 levels of TTA (adding horizontal flip, vertical flip, scaling, rotation). instead of helping it degraded all models significantly - faster r-cnn went from 87% down to 30% at 8x TTA. even at 3x TTA it dropped to 56%. this was unexpected, maybe because face detection is sensitive to orientation or the NMS merging wasnt tuned properly for this task
limitations
- class imbalance is a problem. mask_weared_incorrect only has 3% of the data so the model struggles with it
- rf-detr is too slow for realtime (673ms per image)
- didnt try focal loss or class weighting to handle imbalance
- only tested on this specific dataset
next steps
if i had more time id try
- focal loss to handle class imbalance
- more data collection for the rare class
- knowledge distillation to get a smaller faster model
- test on different datasets to check generalization
project structure
βββ inference.py # run inference on images
βββ requirements.txt # dependencies
βββ data.ipynb # dataset prep
βββ augment_data.ipynb # augmentation pipeline
βββ training-*.ipynb # training notebooks for each model
βββ tta_experiment.ipynb # TTA experiments
βββ benchmark_v3.ipynb # final comparison
βββ dataset/ # train/val/test splits
βββ runs/ # trained model weights
model weights
weights are available on huggingface
https://huggingface.co/ZhafranR/face-mask-detection-verihub
download and put them in the runs folder or specify path with --weights flag
environment
- python 3.8+
- pytorch 2.0+
- GPU optional but recommended (tested on CUDA)
if running on CPU just know inference will be slower especially for rf-detr