Update README.md
Browse files
README.md
CHANGED
|
@@ -158,16 +158,26 @@ In English, our model is 46% as good as Llama-2-13b-chat, even though it did not
|
|
| 158 |
|
| 159 |
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
| 160 |
For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
|
| 161 |
-
Note that **GPT-4
|
| 162 |
-
Using GPT-4 to evaluate ChatGPT-3.5 can also be tricky not only for safety aspects because they likely follow a similar training strategy with similar data.
|
| 163 |
-
Meanwhile, most of the safety-related questions and expected responses in this test set are globally acceptable,
|
| 164 |
-
whereas we leave those with conflicting and controversial opinions, as well as more comprehensive human evaluation for future update.
|
| 165 |
|
| 166 |
<div class="row" style="display: flex; clear: both;">
|
| 167 |
<img src="seallm_vs_chatgpt_by_lang.png" alt="Snow" style="float: left; width: 49.5%">
|
| 168 |
<img src="seallm_vs_chatgpt_by_cat_sea.png" alt="Forest" style="float: left; width: 49.5%">
|
| 169 |
</div>
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
### M3Exam - World Knowledge in Regional Languages
|
| 172 |
|
| 173 |
|
|
|
|
| 158 |
|
| 159 |
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
| 160 |
For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
|
| 161 |
+
Note that using **GPT-4** to evaluate ChatGPT-3.5 can also be tricky not only for safety aspects because they likely follow a similar training strategy with similar data.
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
<div class="row" style="display: flex; clear: both;">
|
| 164 |
<img src="seallm_vs_chatgpt_by_lang.png" alt="Snow" style="float: left; width: 49.5%">
|
| 165 |
<img src="seallm_vs_chatgpt_by_cat_sea.png" alt="Forest" style="float: left; width: 49.5%">
|
| 166 |
</div>
|
| 167 |
|
| 168 |
+
As **GPT-4**, which was built for global use, may not consider certain safety-related responses as harmful or sensitive in the local context,
|
| 169 |
+
while certain sensitive topics may entail conflicting and controversial opinions across cultures.
|
| 170 |
+
We engage native linguists to rate and compare SeaLLM's and ChatGPT responses to a natural and local-aware safety test set.
|
| 171 |
+
The linguists choose a winner or a tie in a totally randomized and double-blind manner, which means both we and the linguists do not know the responses' origins.
|
| 172 |
+
|
| 173 |
+
As shown in human evaluation below, SeaLLM is tie with ChatGPT in most cases, while outperforming ChatGPT for Vi and Th.
|
| 174 |
+
|
| 175 |
+
| Safety Human Eval | Id | Th | Vi | Avg
|
| 176 |
+
|-----------| ------- | ------- | ------- | -------
|
| 177 |
+
| SeaLLM-13b Win | 12.09% | 23.40% | 8.42% | 14.64%
|
| 178 |
+
| Tie | 65.93% | 67.02% | 89.47% | 74.29%
|
| 179 |
+
| ChatGPT Win | 21.98% | 9.57% | 2.11% | 11.07%
|
| 180 |
+
|
| 181 |
### M3Exam - World Knowledge in Regional Languages
|
| 182 |
|
| 183 |
|