Update README.md
Browse files
README.md
CHANGED
|
@@ -82,112 +82,6 @@ The training data was aggregated from multiple sources:
|
|
| 82 |
OpenAI API labeling was performed by combining Human-in-the-loop machine learning—where prompt engineering was applied to select the most accurate prompt—with the OpenAI API (gpt-4o-mini) to generate labels.
|
| 83 |
|
| 84 |
|
| 85 |
-
#### Translation Prompt
|
| 86 |
-
```
|
| 87 |
-
You are a professional translator.
|
| 88 |
-
Your task is to translate the given text into English if it is not already in English.
|
| 89 |
-
The text originates from Malaysian news articles and public commentary. Therefore, please pay extra attention to local language expressions, proper names, abbreviations, and cultural nuances that are specific to Malaysia.
|
| 90 |
-
The possible source languages are Malay, Chinese, and Tamil.
|
| 91 |
-
If the text is already in English, simply return None.
|
| 92 |
-
|
| 93 |
-
Text: {text}
|
| 94 |
-
|
| 95 |
-
Output the result in the following JSON format:
|
| 96 |
-
{
|
| 97 |
-
"translated_text": "<Translated text or None>"
|
| 98 |
-
}
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
#### Classification Prompt
|
| 102 |
-
```
|
| 103 |
-
INSTRUCTION
|
| 104 |
-
You are a classifier focusing on Malaysian news articles.
|
| 105 |
-
Classify the following text according to 12 topics, and for each topic,
|
| 106 |
-
assign exactly one of [unknown, negative, neutral, positive].
|
| 107 |
-
|
| 108 |
-
The 12 topics are:
|
| 109 |
-
1. democracy
|
| 110 |
-
2. economy
|
| 111 |
-
3. race
|
| 112 |
-
4. leadership
|
| 113 |
-
5. development
|
| 114 |
-
6. corruption
|
| 115 |
-
7. political instability
|
| 116 |
-
8. safety
|
| 117 |
-
9. administration
|
| 118 |
-
10. education
|
| 119 |
-
11. religion
|
| 120 |
-
12. environment
|
| 121 |
-
|
| 122 |
-
GUIDELINES:
|
| 123 |
-
- If the article does not mention or imply anything about the topic, label it as "unknown".
|
| 124 |
-
- If the article mentions the topic in a negative or critical way, label it as "negative".
|
| 125 |
-
- If the article mentions the topic without clear negativity or positivity, label it as "neutral".
|
| 126 |
-
- If the article mentions the topic in a clearly positive or supportive way, label it as "positive".
|
| 127 |
-
- Return only the JSON object with the 12 keys, no extra explanation or text.
|
| 128 |
-
|
| 129 |
-
EXAMPLES:
|
| 130 |
-
[Example1]
|
| 131 |
-
1. Race: Malaysia's rich tapestry of cultures and ethnicities fosters a vibrant society where diversity is celebrated, promoting unity and mutual respect among its people.
|
| 132 |
-
2. Economy: The Malaysian economy is showing resilience and adaptability, with innovative sectors emerging that promise sustainable growth and increased opportunities for all citizens.
|
| 133 |
-
3. Development: Malaysia's commitment to sustainable development is evident in its investment in green technologies and infrastructure, paving the way for a brighter and more sustainable future for generations to come.
|
| 134 |
-
Classify as:
|
| 135 |
-
{{
|
| 136 |
-
"democracy": "unknown",
|
| 137 |
-
"economy": "positive"
|
| 138 |
-
"race" "positive",
|
| 139 |
-
"leadership": "unknown",
|
| 140 |
-
"development: "positive",
|
| 141 |
-
"corruption: "unknown",
|
| 142 |
-
"political instability: "unknown",
|
| 143 |
-
"safety: "unknown",
|
| 144 |
-
"administration: "unknown",
|
| 145 |
-
"education: "unknown",
|
| 146 |
-
"religion: "unknown",
|
| 147 |
-
"environment: "unknown"
|
| 148 |
-
}}
|
| 149 |
-
|
| 150 |
-
[Example2]
|
| 151 |
-
Corruption remains a significant challenge in Malaysia, influencing various sectors and prompting ongoing discussions about governance and accountability. Addressing this issue is vital for fostering trust in public institutions and promoting sustainable development. #Malaysia #Corruption
|
| 152 |
-
Classify as:
|
| 153 |
-
{{
|
| 154 |
-
"democracy": "unknown",
|
| 155 |
-
"economy": "unknown",
|
| 156 |
-
"race" "unknown",
|
| 157 |
-
"leadership": "unknown",
|
| 158 |
-
"development: "unknown",
|
| 159 |
-
"corruption: "neutral",
|
| 160 |
-
"political instability: "unknown",
|
| 161 |
-
"safety: "unknown",
|
| 162 |
-
"administration: "unknown",
|
| 163 |
-
"education: "unknown",
|
| 164 |
-
"religion: "unknown",
|
| 165 |
-
"environment: "unknown"
|
| 166 |
-
}}
|
| 167 |
-
|
| 168 |
-
[Example3]
|
| 169 |
-
The 13 May incident was an episode of Sino-Malay sectarian violence that took place in Kuala Lumpur, the capital of Malaysia, on 13 May 1969. The riot occurred in the aftermath of the 1969 Malaysian general election when opposition parties such as the Democratic Action Party and Gerakan made gains at the expense of the ruling coalition, the Alliance Party.
|
| 170 |
-
Classify as:
|
| 171 |
-
{{
|
| 172 |
-
"democracy": "unknown",
|
| 173 |
-
"economy": "unknown",
|
| 174 |
-
"race" "negative",
|
| 175 |
-
"leadership": "unknown",
|
| 176 |
-
"development: "unknown",
|
| 177 |
-
"corruption: "unknown",
|
| 178 |
-
"political instability: "negative",
|
| 179 |
-
"safety: "unknown",
|
| 180 |
-
"administration: "unknown",
|
| 181 |
-
"education: "unknown",
|
| 182 |
-
"religion: "unknown",
|
| 183 |
-
"environment: "unknown"
|
| 184 |
-
}}
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
TEXT:
|
| 188 |
-
{text}
|
| 189 |
-
```
|
| 190 |
-
|
| 191 |
#### Synthetic Data via Data Augmentation
|
| 192 |
- **Method**: Synthetic data was generated to balance the dataset by augmenting underrepresented labels or sentiments.
|
| 193 |
|
|
|
|
| 82 |
OpenAI API labeling was performed by combining Human-in-the-loop machine learning—where prompt engineering was applied to select the most accurate prompt—with the OpenAI API (gpt-4o-mini) to generate labels.
|
| 83 |
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
#### Synthetic Data via Data Augmentation
|
| 86 |
- **Method**: Synthetic data was generated to balance the dataset by augmenting underrepresented labels or sentiments.
|
| 87 |
|