Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# EmotionCLIP Model
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+

|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
## Project Overview
|
| 9 |
+
|
| 10 |
+
EmotionCLIP is an open-domain multimodal emotion perception model built on CLIP. This model aims to perform broad emotion recognition through multimodal inputs such as faces, scenes, and photos, supporting the analysis of emotional attributes in images, scene layouts, and even artworks.
|
| 11 |
+
|
| 12 |
+
## Datasets
|
| 13 |
+
|
| 14 |
+
The model is trained using the following datasets:
|
| 15 |
+
|
| 16 |
+
1. **EmoSet**:
|
| 17 |
+
- Citation:
|
| 18 |
+
```
|
| 19 |
+
@inproceedings{yang2023emoset,
|
| 20 |
+
title={EmoSet: A Large-Scale Visual Emotion Dataset with Rich Attributes},
|
| 21 |
+
author={Yang, Jingyuan and Huang, Qirui and Ding, Tingting and Lischinski, Dani and Cohen-Or, Danny and Huang, Hui},
|
| 22 |
+
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
|
| 23 |
+
pages={20383--20394},
|
| 24 |
+
year={2023}
|
| 25 |
+
}
|
| 26 |
+
```
|
| 27 |
+
- This dataset contains rich emotional labels and visual features, providing a foundation for emotion perception.
|
| 28 |
+
|
| 29 |
+
2. **Open Human Facial Emotion Recognition Dataset**:
|
| 30 |
+
- Contains nearly 10,000 images with emotion labels gathered from wild scenes to enhance the model's capability in facial emotion recognition.
|
| 31 |
+
|
| 32 |
+
## Fine-tuning Weights
|
| 33 |
+
|
| 34 |
+
This repository provides two fine-tuned weights:
|
| 35 |
+
|
| 36 |
+
1. **EmotionCLIP Weights**
|
| 37 |
+
- Fine-tuned on the EmoSet 118K dataset, without additional training specifically for facial emotion recognition.
|
| 38 |
+
- Final evaluation results:
|
| 39 |
+
- Loss: 1.5687
|
| 40 |
+
- Accuracy: 0.8037
|
| 41 |
+
- Recall: 0.8037
|
| 42 |
+
- F1: 0.8033
|
| 43 |
+
|
| 44 |
+
2. **MixCLIP Weights**
|
| 45 |
+
- Integrates the 20,000 face images and enhances the data for the calm category, which is not included in EmoSet.
|
| 46 |
+
- Due to the small number of samples in this category, the model's recognition ability remains inadequate.
|
| 47 |
+
- Final evaluation results:
|
| 48 |
+
- Loss: 1.5680
|
| 49 |
+
- Accuracy: 0.8042
|
| 50 |
+
- Recall: 0.8042
|
| 51 |
+
- F1: 0.8057
|
| 52 |
+
|
| 53 |
+
## Usage Instructions
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
git clone https://huggingface.co/jiangchengchengNLP/EmotionCLIP
|
| 57 |
+
|
| 58 |
+
cd EmotionCLIP
|
| 59 |
+
|
| 60 |
+
# By default, MixCLIP weights are used. Run the following python command in the current folder.
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
from EmotionCLIP import model, preprocess, tokenizer
|
| 65 |
+
from PIL import Image
|
| 66 |
+
import torch
|
| 67 |
+
import matplotlib.pyplot as plt
|
| 68 |
+
import os
|
| 69 |
+
from torch.nn import functional as F
|
| 70 |
+
|
| 71 |
+
# Image folder path
|
| 72 |
+
image_folder = r'./test'
|
| 73 |
+
image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
|
| 74 |
+
|
| 75 |
+
# Emotion label mapping
|
| 76 |
+
consist_json = {
|
| 77 |
+
'amusement': 0,
|
| 78 |
+
'anger': 1,
|
| 79 |
+
'awe': 2,
|
| 80 |
+
'contentment': 3,
|
| 81 |
+
'disgust': 4,
|
| 82 |
+
'excitement': 5,
|
| 83 |
+
'fear': 6,
|
| 84 |
+
'sadness': 7,
|
| 85 |
+
#'neutral': 8
|
| 86 |
+
}
|
| 87 |
+
reversal_json = {v: k for k, v in consist_json.items()}
|
| 88 |
+
text_list = [f"This picture conveys a sense of {key}" for key in consist_json.keys()]
|
| 89 |
+
text_input = tokenizer(text_list)
|
| 90 |
+
|
| 91 |
+
# Create subplots
|
| 92 |
+
num_images = len(image_files)
|
| 93 |
+
rows = 3 # 3 rows
|
| 94 |
+
cols = 3 # 3 columns
|
| 95 |
+
fig, axes = plt.subplots(rows, cols, figsize=(15, 10)) # Adjust the canvas size
|
| 96 |
+
axes = axes.flatten() # Flatten the subplots to a 1D array
|
| 97 |
+
title_fontsize = 20
|
| 98 |
+
|
| 99 |
+
# Iterate through each image
|
| 100 |
+
for idx, img_path in enumerate(image_files):
|
| 101 |
+
# Load image
|
| 102 |
+
img = Image.open(img_path)
|
| 103 |
+
img_input = preprocess(img)
|
| 104 |
+
|
| 105 |
+
# Predict emotion
|
| 106 |
+
with torch.no_grad():
|
| 107 |
+
logits_per_image, _ = model(img_input.unsqueeze(0).to(device=model.device, dtype=model.dtype), text_input.to(device=model.device))
|
| 108 |
+
softmax_logits_per_image = F.softmax(logits_per_image, dim=-1)
|
| 109 |
+
top_k_values, top_k_indexes = torch.topk(softmax_logits_per_image, k=1, dim=-1)
|
| 110 |
+
predicted_emotion = reversal_json[top_k_indexes.item()]
|
| 111 |
+
|
| 112 |
+
# Display image and prediction result
|
| 113 |
+
ax = axes[idx]
|
| 114 |
+
ax.imshow(img)
|
| 115 |
+
ax.set_title(f"Predicted: {predicted_emotion}", fontsize=title_fontsize)
|
| 116 |
+
ax.axis('off')
|
| 117 |
+
|
| 118 |
+
# Hide any extra subplots
|
| 119 |
+
for idx in range(num_images, rows * cols):
|
| 120 |
+
axes[idx].axis('off')
|
| 121 |
+
|
| 122 |
+
plt.tight_layout()
|
| 123 |
+
plt.show()
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## Result Display
|
| 127 |
+
|
| 128 |
+
The best evaluation results of the model are shown below:
|
| 129 |
+
|
| 130 |
+
| Metric | EmotionCLIP | MixCLIP |
|
| 131 |
+
|----------|------------------|------------------|
|
| 132 |
+
| Loss | 1.5687 | 1.5680 |
|
| 133 |
+
| Accuracy | 0.8037 | 0.8042 |
|
| 134 |
+
| Recall | 0.8037 | 0.8042 |
|
| 135 |
+
| F1 | 0.8033 | 0.8057 |
|
| 136 |
+
|
| 137 |
+
## Existing Issues
|
| 138 |
+
|
| 139 |
+
When recognizing fine-grained human emotions and broad emotional attributes, the model faces significant challenges. It must simultaneously capture human body language and subtle facial changes while maintaining an overall perception of scenes and photo subjects, which can lead to competitive cognition.
|
| 140 |
+
|
| 141 |
+
Specifically, for the “disgust” category, the model often misclassifies it as sadness or anger, partly because human expressions of disgust tend to be unclear.
|
| 142 |
+
|
| 143 |
+
Moreover, the dataset’s "disgust" category contains mainly non-human images, causing the model to favor global recognition, which hinders its ability to capture the subtle differences in disgust.
|
| 144 |
+
|
| 145 |
+
In this experiment, we extended the emotion recognition task to an emotion perception task, requiring the model to not only perceive human emotional changes but also possess the ability to sense emotions in the physical world. Although this goal is challenging, we found that the model's emotion perception remains driven by illusions, making it difficult to achieve stable, common-sense-based understanding.
|
| 146 |
+
|
| 147 |
+
### Summary
|
| 148 |
+
|
| 149 |
+
We explored the broad field of emotion perception using CLIP on the EmoSet and partial facial datasets, providing two fine-tuned weights (EmosetCLIP and MixCLIP). However, there are still many challenges in expanding from facial emotion recognition to broad-field emotion perception, including the conflict between fine-grained emotion capture and global emotion perception, as well as issues related to data imbalance.
|
| 150 |
+
|
| 151 |
+
---
|