Improve model card: add usage example, fix preprocessing details, expand limitations
Browse filesAdd an extended description including:
- Working code example using the transformers pipeline
- Updated preprocessing details with specific values (224x224, ImageNet normalization)
- Expanded limitations section with concrete details on dataset bias, class imbalance, skin tone bias, and input requirements
README.md
CHANGED
|
@@ -1,24 +1,17 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
| 4 |
# Vision Transformer (ViT) for Facial Expression Recognition Model Card
|
| 5 |
|
| 6 |
## Model Overview
|
| 7 |
-
|
| 8 |
- **Model Name:** [trpakov/vit-face-expression](https://huggingface.co/trpakov/vit-face-expression)
|
| 9 |
-
|
| 10 |
- **Task:** Facial Expression/Emotion Recognition
|
| 11 |
-
|
| 12 |
- **Dataset:** [FER2013](https://www.kaggle.com/datasets/msambare/fer2013)
|
| 13 |
-
|
| 14 |
- **Model Architecture:** [Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)
|
| 15 |
-
|
| 16 |
- **Finetuned from model:** [vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)
|
| 17 |
|
| 18 |
## Model Description
|
| 19 |
-
|
| 20 |
The vit-face-expression model is a Vision Transformer fine-tuned for the task of facial emotion recognition.
|
| 21 |
-
|
| 22 |
It is trained on the FER2013 dataset, which consists of facial images categorized into seven different emotions:
|
| 23 |
- Angry
|
| 24 |
- Disgust
|
|
@@ -29,18 +22,44 @@ It is trained on the FER2013 dataset, which consists of facial images categorize
|
|
| 29 |
- Neutral
|
| 30 |
|
| 31 |
## Data Preprocessing
|
| 32 |
-
|
| 33 |
The input images are preprocessed before being fed into the model. The preprocessing steps include:
|
| 34 |
-
- **Resizing:** Images are resized to
|
| 35 |
-
- **Normalization:** Pixel values are normalized
|
| 36 |
- **Data Augmentation:** Random transformations such as rotations, flips, and zooms are applied to augment the training dataset.
|
| 37 |
|
| 38 |
## Evaluation Metrics
|
| 39 |
-
|
| 40 |
- **Validation set accuracy:** 0.7113
|
| 41 |
- **Test set accuracy:** 0.7116
|
| 42 |
|
| 43 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
# Vision Transformer (ViT) for Facial Expression Recognition Model Card
|
| 5 |
|
| 6 |
## Model Overview
|
|
|
|
| 7 |
- **Model Name:** [trpakov/vit-face-expression](https://huggingface.co/trpakov/vit-face-expression)
|
|
|
|
| 8 |
- **Task:** Facial Expression/Emotion Recognition
|
|
|
|
| 9 |
- **Dataset:** [FER2013](https://www.kaggle.com/datasets/msambare/fer2013)
|
|
|
|
| 10 |
- **Model Architecture:** [Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)
|
|
|
|
| 11 |
- **Finetuned from model:** [vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)
|
| 12 |
|
| 13 |
## Model Description
|
|
|
|
| 14 |
The vit-face-expression model is a Vision Transformer fine-tuned for the task of facial emotion recognition.
|
|
|
|
| 15 |
It is trained on the FER2013 dataset, which consists of facial images categorized into seven different emotions:
|
| 16 |
- Angry
|
| 17 |
- Disgust
|
|
|
|
| 22 |
- Neutral
|
| 23 |
|
| 24 |
## Data Preprocessing
|
|
|
|
| 25 |
The input images are preprocessed before being fed into the model. The preprocessing steps include:
|
| 26 |
+
- **Resizing:** Images are resized to 224x224 pixels before being fed into the model.
|
| 27 |
+
- **Normalization:** Pixel values are normalized using ImageNet mean and standard deviation.
|
| 28 |
- **Data Augmentation:** Random transformations such as rotations, flips, and zooms are applied to augment the training dataset.
|
| 29 |
|
| 30 |
## Evaluation Metrics
|
|
|
|
| 31 |
- **Validation set accuracy:** 0.7113
|
| 32 |
- **Test set accuracy:** 0.7116
|
| 33 |
|
| 34 |
+
## Usage
|
| 35 |
+
```python
|
| 36 |
+
from transformers import pipeline
|
| 37 |
+
from PIL import Image
|
| 38 |
+
|
| 39 |
+
# Load the model
|
| 40 |
+
pipe = pipeline("image-classification", model="trpakov/vit-face-expression")
|
| 41 |
+
|
| 42 |
+
# Load an image (must contain a face)
|
| 43 |
+
image = Image.open("your_image.jpg").convert("RGB")
|
| 44 |
|
| 45 |
+
# Run inference
|
| 46 |
+
results = pipe(image)
|
| 47 |
+
|
| 48 |
+
# Output: list of dicts with 'label' and 'score'
|
| 49 |
+
# Example: [{'label': 'happy', 'score': 0.98}, {'label': 'neutral', 'score': 0.01}, ...]
|
| 50 |
+
print(results)
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Limitations
|
| 54 |
+
- **Dataset bias:** FER2013 is collected from Google Image Search and is known
|
| 55 |
+
to contain noisy and mislabelled samples, which affects model reliability.
|
| 56 |
+
- **Class imbalance:** The dataset is heavily skewed toward "happy" and "neutral",
|
| 57 |
+
making the model less reliable for underrepresented classes like "disgust" and "fear".
|
| 58 |
+
- **Skin tone bias:** The model may perform worse on darker skin tones due to
|
| 59 |
+
underrepresentation in the training data.
|
| 60 |
+
- **Input requirements:** The model expects a cropped, frontal face image.
|
| 61 |
+
Performance degrades significantly on profile faces, occluded faces, or
|
| 62 |
+
images where the face is not the primary subject.
|
| 63 |
+
- **Image size:** Input images are resized to 224x224 pixels internally.
|
| 64 |
+
- **Real-world generalization:** Lab-posed expressions in training data differ
|
| 65 |
+
from natural spontaneous expressions in the wild.
|