Update README.md
Browse files
README.md
CHANGED
|
@@ -50,18 +50,18 @@ For each topic, the model assigns one of four sentiment labels: **unknown, negat
|
|
| 50 |
|
| 51 |
The training data was aggregated from multiple sources:
|
| 52 |
|
| 53 |
-
| Data Source | N |
|
| 54 |
-
|
| 55 |
-
| English Newspaper | 5912 |
|
| 56 |
-
| English Newspaper Comments (Facebook) | 8471 |
|
| 57 |
-
| Malay Newspaper | 5254 |
|
| 58 |
-
| Chinese Newspaper | 2480 |
|
| 59 |
-
| Tamil Newspaper | 1512 |
|
| 60 |
-
| Reddit | 20000 |
|
| 61 |
-
| Manifesto BN | 98 |
|
| 62 |
-
| Manifesto PH | 180 |
|
| 63 |
-
| Manifesto PN | 15 |
|
| 64 |
-
| Synthetic Data | 4124 |
|
| 65 |
|
| 66 |
- **NOTE**: The originally aggregated dataset, which included data from various sources (such as English Newspapers, Facebook comments, Malay, Chinese, and Tamil Newspapers, Reddit, Manifestos, and Synthetic Data), contained some noise and misclassifications; after removing these noisy entries, 47,966 clean data points were used for training.
|
| 67 |
|
|
|
|
| 50 |
|
| 51 |
The training data was aggregated from multiple sources:
|
| 52 |
|
| 53 |
+
| Data Source | N | Labeling Method |
|
| 54 |
+
|---------------------------------------|-------|------------------------------------------------------------------|
|
| 55 |
+
| English Newspaper | 5912 | BERT (MyPoliBERT-ver03 was used) |
|
| 56 |
+
| English Newspaper Comments (Facebook) | 8471 | BERT |
|
| 57 |
+
| Malay Newspaper | 5254 | OpenAI API (translated to English then classified) |
|
| 58 |
+
| Chinese Newspaper | 2480 | OpenAI API (translated to English then classified) |
|
| 59 |
+
| Tamil Newspaper | 1512 | OpenAI API (translated to English then classified) |
|
| 60 |
+
| Reddit | 20000 | BERT (MyPoliBERT-ver03 was used) |
|
| 61 |
+
| Manifesto BN | 98 | OpenAI API |
|
| 62 |
+
| Manifesto PH | 180 | OpenAI API |
|
| 63 |
+
| Manifesto PN | 15 | OpenAI API |
|
| 64 |
+
| Synthetic Data | 4124 | OpenAI API |
|
| 65 |
|
| 66 |
- **NOTE**: The originally aggregated dataset, which included data from various sources (such as English Newspapers, Facebook comments, Malay, Chinese, and Tamil Newspapers, Reddit, Manifestos, and Synthetic Data), contained some noise and misclassifications; after removing these noisy entries, 47,966 clean data points were used for training.
|
| 67 |
|