jiangchengchengNLP commited on
Commit
a857c64
·
verified ·
1 Parent(s): d64c8ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -37
README.md CHANGED
@@ -38,28 +38,20 @@ The model is trained using the following datasets:
38
 
39
  ## training method
40
 
41
- Prefix-Tuning
42
 
43
  ## Fine-tuning Weights
44
 
45
- This repository provides two fine-tuned weights:
46
 
47
- 1. **EmotionCLIP Weights**
48
  - Fine-tuned on the EmoSet 118K dataset, without additional training specifically for facial emotion recognition.
49
  - Final evaluation results:
50
- - Loss: 1.5687
51
- - Accuracy: 0.8037
52
- - Weighted Recall: 0.8037
53
- - F1: 0.8033
54
-
55
- 2. **MixCLIP Weights**
56
- - Integrates the 10,000 face images and enhances the data for the neutral category, which is not included in EmoSet.
57
- - Due to the small number of samples in this category, the model's recognition ability remains inadequate.
58
- - Final evaluation results:
59
- - Loss: 1.5680
60
- - Accuracy: 0.8042
61
- - Recall: 0.8042
62
- - F1: 0.8057
63
 
64
  ## Usage Instructions
65
 
@@ -80,7 +72,7 @@ import os
80
  from torch.nn import functional as F
81
 
82
  # Image folder path
83
- image_folder = r'./test'
84
  image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
85
 
86
  # Emotion label mapping
@@ -93,7 +85,7 @@ consist_json = {
93
  'excitement': 5,
94
  'fear': 6,
95
  'sadness': 7,
96
- #'neutral': 8
97
  }
98
  reversal_json = {v: k for k, v in consist_json.items()}
99
  text_list = [f"This picture conveys a sense of {key}" for key in consist_json.keys()]
@@ -134,29 +126,12 @@ plt.tight_layout()
134
  plt.show()
135
  ```
136
 
137
- ## Result Display
138
-
139
- The best evaluation results of the model are shown below:
140
-
141
- | Metric | EmotionCLIP | MixCLIP |
142
- |----------|------------------|------------------|
143
- | Loss | 1.5687 | 1.5680 |
144
- | Accuracy | 0.8037 | 0.8042 |
145
- | Recall | 0.8037 | 0.8042 |
146
- | F1 | 0.8033 | 0.8057 |
147
-
148
  ## Existing Issues
 
149
 
150
- When recognizing fine-grained human emotions and broad emotional attributes, the model faces significant challenges. It must simultaneously capture human body language and subtle facial changes while maintaining an overall perception of scenes and photo subjects, which can lead to competitive cognition.
151
-
152
- Specifically, for the “disgust” category, the model often misclassifies it as sadness or anger, partly because human expressions of disgust tend to be unclear.
153
-
154
- Moreover, the dataset’s "disgust" category contains mainly non-human images, causing the model to favor global recognition, which hinders its ability to capture the subtle differences in disgust.
155
-
156
- In this experiment, we extended the emotion recognition task to an emotion perception task, requiring the model to not only perceive human emotional changes but also possess the ability to generate emotions from the physical world. Although this goal is exciting, we found that the model's emotion going remains driven by illusions, making it difficult to achieve stable, common-sense-based understanding.
157
 
158
  ### Summary
 
159
 
160
- We explored the broad field of emotion perception using CLIP on the EmoSet and partial facial datasets, providing two fine-tuned weights (EmosetCLIP and MixCLIP). However, there are still many challenges in expanding from facial emotion recognition to broad-field emotion perception, including the conflict between fine-grained emotion capture and global emotion perception, as well as issues related to data imbalance.
161
 
162
  ---
 
38
 
39
  ## training method
40
 
41
+ Combining the fine-tuning methods of layer_norm tuning, prefix tuning, and prompt tuning, the practical results show that the mixture of the three training methods can be comparable to or even exceed the performance of full fine-tuning in generalized visual emotion recognition by introducing only a small number of parameters. In addition, thanks to the adjustment of layer norm, it converges faster than prefix tuning and prompt tuning, achieving higher performance than EmotionCLIP-V1.
42
 
43
  ## Fine-tuning Weights
44
 
45
+ This repository provides one fine-tuned weights:
46
 
47
+ 1. **EmotionCLIP-V2 Weights**
48
  - Fine-tuned on the EmoSet 118K dataset, without additional training specifically for facial emotion recognition.
49
  - Final evaluation results:
50
+ - Loss: 1.5465
51
+ - Accuracy: 0.8256
52
+ - Macro_Recall: 0.7803
53
+ - F1: 0.8235
54
+
 
 
 
 
 
 
 
 
55
 
56
  ## Usage Instructions
57
 
 
72
  from torch.nn import functional as F
73
 
74
  # Image folder path
75
+ image_folder = r'./test' #test images are in EmotionCLIP repo : jiangchengchengNLP/EmotionCLIP
76
  image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]
77
 
78
  # Emotion label mapping
 
85
  'excitement': 5,
86
  'fear': 6,
87
  'sadness': 7,
88
+ 'neutral': 8
89
  }
90
  reversal_json = {v: k for k, v in consist_json.items()}
91
  text_list = [f"This picture conveys a sense of {key}" for key in consist_json.keys()]
 
126
  plt.show()
127
  ```
128
 
 
 
 
 
 
 
 
 
 
 
 
129
  ## Existing Issues
130
+ The hybrid fine-tuning method improved the model by 2% in the prediction task after the introduction of the neutral category, but this introduction still has noise, which will interfere with the emotion recognition in other scenes. The introduction of prompt tuning is the key to surpassing the effect of full fine-tuning, and the introduction of layer norm tuning makes the convergence faster during training. But this also has disadvantages. After mixing so many fine-tuning methods, the generalization performance of the model has seriously declined. At the same time, the recognition of difficult categories disgust and anger has not been improved. Although I have deliberately added some disgusting pictures of humans, the effect is still not as expected. Therefore, it is still necessary to build a high-quality, large-scale visual emotion dataset. I can feel that the performance of the model is limited by the number of datasets that are far less than the pre-training dataset. At the same time, seeking breakthroughs in model structure will also provide great help for this problem.
131
 
 
 
 
 
 
 
 
132
 
133
  ### Summary
134
+ I proposed a hybrid layer_norm prefix_tuning prompt_tuning training method for efficient fine-tuning CLIP, which can make the model converge faster and have performance comparable to full fine-tuning. However, the loss of generalization performance is still a serious problem. I released EmosetCLIP-V2 trained with this training method, which has an additional neutral category compared to EmosetCLIP-V1, and the performance is slightly improved. Future work aims to expand the training data for difficult categories and optimize the model architecture.
135
 
 
136
 
137
  ---