CesarWKR1 commited on
Commit
6e1eb54
·
verified ·
1 Parent(s): 1de9d90

Add an explanation of the dataset used to train the model

Browse files
Files changed (1) hide show
  1. README.md +33 -37
README.md CHANGED
@@ -1,49 +1,42 @@
1
- ---
2
  language:
3
- - en
4
-
5
  license: mit
6
-
7
  tags:
8
- - sentiment-analysis
9
- - roberta
10
- - transformers
11
- - pytorch
12
- - fastapi
13
- - multilingual
14
- - docker
15
- - kafka
16
- - nlp
17
-
18
  pipeline_tag: text-classification
19
-
20
  library_name: transformers
21
-
22
  datasets:
23
- - Reddit
24
-
25
  metrics:
26
- - accuracy
27
- - f1
28
-
29
  model-index:
30
- - name: sentiment-roberta
31
- results:
32
- - task:
33
- type: text-classification
34
- name: Sentiment Analysis
35
- dataset:
36
- name: Reddit Sentiment Dataset
37
- type: custom
38
- metrics:
39
- - type: accuracy
40
- value: 0.8709
41
- name: Accuracy
42
- - type: f1
43
- value: 0.8715
44
- name: Weighted F1 Score
45
  ---
46
 
 
47
  ## 📌 Project Philosophy
48
 
49
  This project intentionally preserves a controlled amount of real-world noise inside the final training dataset instead of aggressively sanitizing every sample.
@@ -305,9 +298,12 @@ Database (cleaned_data)
305
  (Others)
306
  → relabeled_data:
307
  → balanced_data: Balance dataset distribution through oversampling, downsampling or both
308
- → combined: Combine relabeled and cleaned_data
309
  → synthetic_data: Data generated by GPT-2 for oversampling
310
 
 
 
 
311
 
312
  ## 🧹 4. Text Cleaning, Data Augmentation & Dataset Balancing
313
 
 
 
1
  language:
2
+ - en
 
3
  license: mit
 
4
  tags:
5
+ - sentiment-analysis
6
+ - roberta
7
+ - transformers
8
+ - pytorch
9
+ - fastapi
10
+ - multilingual
11
+ - docker
12
+ - kafka
13
+ - nlp
 
14
  pipeline_tag: text-classification
 
15
  library_name: transformers
 
16
  datasets:
17
+ - Reddit
 
18
  metrics:
19
+ - accuracy
20
+ - f1
 
21
  model-index:
22
+ - name: sentiment-roberta
23
+ results:
24
+ - task:
25
+ type: text-classification
26
+ name: Sentiment Analysis
27
+ dataset:
28
+ name: Reddit Sentiment Dataset
29
+ type: custom
30
+ metrics:
31
+ - type: accuracy
32
+ value: 0.8709
33
+ name: Accuracy
34
+ - type: f1
35
+ value: 0.8715
36
+ name: Weighted F1 Score
37
  ---
38
 
39
+
40
  ## 📌 Project Philosophy
41
 
42
  This project intentionally preserves a controlled amount of real-world noise inside the final training dataset instead of aggressively sanitizing every sample.
 
298
  (Others)
299
  → relabeled_data:
300
  → balanced_data: Balance dataset distribution through oversampling, downsampling or both
301
+ → combined: Combine relabeled (raw data) and cleaned data
302
  → synthetic_data: Data generated by GPT-2 for oversampling
303
 
304
+ The dataset used to train the model consists of a mix of balanced and combined data.
305
+ Which is called balanced combined
306
+
307
 
308
  ## 🧹 4. Text Cleaning, Data Augmentation & Dataset Balancing
309