SkyNet-DL
/

sentiment-roberta

@@ -1,49 +1,42 @@
----
 language:
-- en
 license: mit
 tags:
-- sentiment-analysis
-- roberta
-- transformers
-- pytorch
-- fastapi
-- multilingual
-- docker
-- kafka
-- nlp
 pipeline_tag: text-classification
 library_name: transformers
 datasets:
-- Reddit
 metrics:
-- accuracy
-- f1
 model-index:
-- name: sentiment-roberta
-  results:
-  - task:
-      type: text-classification
-      name: Sentiment Analysis
-    dataset:
-      name: Reddit Sentiment Dataset
-      type: custom
-    metrics:
-    - type: accuracy
-      value: 0.8709
-      name: Accuracy
-    - type: f1
-      value: 0.8715
-      name: Weighted F1 Score
 ---
 ## 📌 Project Philosophy
 This project intentionally preserves a controlled amount of real-world noise inside the final training dataset instead of aggressively sanitizing every sample.
@@ -305,9 +298,12 @@ Database (cleaned_data)
 (Others)
    → relabeled_data:
    → balanced_data:   Balance dataset distribution through oversampling, downsampling or both
-   → combined:    Combine relabeled and cleaned_data
    → synthetic_data:  Data generated by GPT-2 for oversampling
 ## 🧹 4. Text Cleaning, Data Augmentation & Dataset Balancing

 language:
+  - en
 license: mit
 tags:
+  - sentiment-analysis
+  - roberta
+  - transformers
+  - pytorch
+  - fastapi
+  - multilingual
+  - docker
+  - kafka
+  - nlp
 pipeline_tag: text-classification
 library_name: transformers
 datasets:
+  - Reddit
 metrics:
+  - accuracy
+  - f1
 model-index:
+  - name: sentiment-roberta
+    results:
+      - task:
+          type: text-classification
+          name: Sentiment Analysis
+        dataset:
+          name: Reddit Sentiment Dataset
+          type: custom
+        metrics:
+          - type: accuracy
+            value: 0.8709
+            name: Accuracy
+          - type: f1
+            value: 0.8715
+            name: Weighted F1 Score
 ---
 ## 📌 Project Philosophy
 This project intentionally preserves a controlled amount of real-world noise inside the final training dataset instead of aggressively sanitizing every sample.
 (Others)
    → relabeled_data:
    → balanced_data:   Balance dataset distribution through oversampling, downsampling or both
+   → combined:   Combine relabeled (raw data) and cleaned data
    → synthetic_data:  Data generated by GPT-2 for oversampling
+The dataset used to train the model consists of a mix of balanced and combined data.
+Which is called balanced combined
 ## 🧹 4. Text Cleaning, Data Augmentation & Dataset Balancing