jiangchengchengNLP commited on
Commit
02716f7
·
verified ·
1 Parent(s): d6e80c3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EmotionCLIP Model
2
+
3
+
4
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/662f655a02d973f5970ccbd3/NHoTlafm0uIv1gn1LbIvj.png)
5
+
6
+
7
+
8
+ ## Project Overview
9
+
10
+ EmotionCLIP is an open-domain multimodal emotion perception model built on CLIP. This model aims to perform broad emotion recognition through multimodal inputs such as faces, scenes, and photos, supporting the analysis of emotional attributes in images, scene layouts, and even artworks.
11
+
12
+ ## Datasets
13
+
14
+ The model is trained using the following datasets:
15
+
16
+ 1. **EmoSet**:
17
+ - Citation:
18
+ ```
19
+ @inproceedings{yang2023emoset,
20
+ title={EmoSet: A Large-Scale Visual Emotion Dataset with Rich Attributes},
21
+ author={Yang, Jingyuan and Huang, Qirui and Ding, Tingting and Lischinski, Dani and Cohen-Or, Danny and Huang, Hui},
22
+ booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
23
+ pages={20383--20394},
24
+ year={2023}
25
+ }
26
+ ```
27
+ - This dataset contains rich emotional labels and visual features, providing a foundation for emotion perception.
28
+
29
+ 2. **Open Human Facial Emotion Recognition Dataset**:
30
+ - Contains nearly 10,000 images with emotion labels gathered from wild scenes to enhance the model's capability in facial emotion recognition.
31
+
32
+ ## Fine-tuning Weights
33
+
34
+ This repository provides two fine-tuned weights:
35
+
36
+ 1. **EmotionCLIP Weights**
37
+ - Fine-tuned on the EmoSet 118K dataset, without additional training specifically for facial emotion recognition.
38
+ - Final evaluation results:
39
+ - Loss: 1.5687
40
+ - Accuracy: 0.8037
41
+ - Recall: 0.8037
42
+ - F1: 0.8033
43
+
44
+ 2. **MixCLIP Weights**
45
+ - Integrates the 20,000 face images and enhances the data for the calm category, which is not included in EmoSet.
46
+ - Due to the small number of samples in this category, the model's recognition ability remains inadequate.
47
+ - Final evaluation results:
48
+ - Loss: 1.5680
49
+ - Accuracy: 0.8042
50
+ - Recall: 0.8042
51
+ - F1: 0.8057
52
+
53
+ ## Usage Instructions
54
+
55
+ ```bash
56
+ git clone https://huggingface.co/jiangchengchengNLP/EmotionCLIP
57
+
58
+ cd EmotionCLIP
59
+
60
+ # By default, MixCLIP weights are used. Run the following python command in the current folder.
61
+ ```
62
+
63
+ ```python
64
+ from EmotionCLIP import model, preprocess, tokenizer
65
+ from PIL import Image
66
+ import torch
67
+ import matplotlib.pyplot as plt
68
+ import os
69
+ from torch.nn import functional as F
70
+
71
+ # Image folder path
72
+ image_folder = r'./test'
73
+ image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
74
+
75
+ # Emotion label mapping
76
+ consist_json = {
77
+ 'amusement': 0,
78
+ 'anger': 1,
79
+ 'awe': 2,
80
+ 'contentment': 3,
81
+ 'disgust': 4,
82
+ 'excitement': 5,
83
+ 'fear': 6,
84
+ 'sadness': 7,
85
+ #'neutral': 8
86
+ }
87
+ reversal_json = {v: k for k, v in consist_json.items()}
88
+ text_list = [f"This picture conveys a sense of {key}" for key in consist_json.keys()]
89
+ text_input = tokenizer(text_list)
90
+
91
+ # Create subplots
92
+ num_images = len(image_files)
93
+ rows = 3 # 3 rows
94
+ cols = 3 # 3 columns
95
+ fig, axes = plt.subplots(rows, cols, figsize=(15, 10)) # Adjust the canvas size
96
+ axes = axes.flatten() # Flatten the subplots to a 1D array
97
+ title_fontsize = 20
98
+
99
+ # Iterate through each image
100
+ for idx, img_path in enumerate(image_files):
101
+ # Load image
102
+ img = Image.open(img_path)
103
+ img_input = preprocess(img)
104
+
105
+ # Predict emotion
106
+ with torch.no_grad():
107
+ logits_per_image, _ = model(img_input.unsqueeze(0).to(device=model.device, dtype=model.dtype), text_input.to(device=model.device))
108
+ softmax_logits_per_image = F.softmax(logits_per_image, dim=-1)
109
+ top_k_values, top_k_indexes = torch.topk(softmax_logits_per_image, k=1, dim=-1)
110
+ predicted_emotion = reversal_json[top_k_indexes.item()]
111
+
112
+ # Display image and prediction result
113
+ ax = axes[idx]
114
+ ax.imshow(img)
115
+ ax.set_title(f"Predicted: {predicted_emotion}", fontsize=title_fontsize)
116
+ ax.axis('off')
117
+
118
+ # Hide any extra subplots
119
+ for idx in range(num_images, rows * cols):
120
+ axes[idx].axis('off')
121
+
122
+ plt.tight_layout()
123
+ plt.show()
124
+ ```
125
+
126
+ ## Result Display
127
+
128
+ The best evaluation results of the model are shown below:
129
+
130
+ | Metric | EmotionCLIP | MixCLIP |
131
+ |----------|------------------|------------------|
132
+ | Loss | 1.5687 | 1.5680 |
133
+ | Accuracy | 0.8037 | 0.8042 |
134
+ | Recall | 0.8037 | 0.8042 |
135
+ | F1 | 0.8033 | 0.8057 |
136
+
137
+ ## Existing Issues
138
+
139
+ When recognizing fine-grained human emotions and broad emotional attributes, the model faces significant challenges. It must simultaneously capture human body language and subtle facial changes while maintaining an overall perception of scenes and photo subjects, which can lead to competitive cognition.
140
+
141
+ Specifically, for the “disgust” category, the model often misclassifies it as sadness or anger, partly because human expressions of disgust tend to be unclear.
142
+
143
+ Moreover, the dataset’s "disgust" category contains mainly non-human images, causing the model to favor global recognition, which hinders its ability to capture the subtle differences in disgust.
144
+
145
+ In this experiment, we extended the emotion recognition task to an emotion perception task, requiring the model to not only perceive human emotional changes but also possess the ability to sense emotions in the physical world. Although this goal is challenging, we found that the model's emotion perception remains driven by illusions, making it difficult to achieve stable, common-sense-based understanding.
146
+
147
+ ### Summary
148
+
149
+ We explored the broad field of emotion perception using CLIP on the EmoSet and partial facial datasets, providing two fine-tuned weights (EmosetCLIP and MixCLIP). However, there are still many challenges in expanding from facial emotion recognition to broad-field emotion perception, including the conflict between fine-grained emotion capture and global emotion perception, as well as issues related to data imbalance.
150
+
151
+ ---