Update README.md
Browse files
README.md
CHANGED
|
@@ -159,9 +159,9 @@ In English, our model is 46% as good as Llama-2-13b-chat, even though it did not
|
|
| 159 |
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
| 160 |
For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
|
| 161 |
Note that **GPT-4**, as built for global use, may not consider certain safety-related responses from ChatGPT as harmful or sensitive in the local context.
|
|
|
|
| 162 |
Meanwhile, most of the safety-related questions and expected responses in this test set are globally acceptable,
|
| 163 |
-
whereas we leave those with conflicting and controversial opinions
|
| 164 |
-
|
| 165 |
|
| 166 |
<div class="row" style="display: flex; clear: both;">
|
| 167 |
<img src="seallm_vs_chatgpt_by_lang.png" alt="Snow" style="float: left; width: 49.5%">
|
|
|
|
| 159 |
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
| 160 |
For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
|
| 161 |
Note that **GPT-4**, as built for global use, may not consider certain safety-related responses from ChatGPT as harmful or sensitive in the local context.
|
| 162 |
+
Using GPT-4 to evaluate ChatGPT-3.5 can also be tricky not only for safety aspects because they likely follow a similar training strategy with similar data.
|
| 163 |
Meanwhile, most of the safety-related questions and expected responses in this test set are globally acceptable,
|
| 164 |
+
whereas we leave those with conflicting and controversial opinions, as well as more comprehensive human evaluation for future update.
|
|
|
|
| 165 |
|
| 166 |
<div class="row" style="display: flex; clear: both;">
|
| 167 |
<img src="seallm_vs_chatgpt_by_lang.png" alt="Snow" style="float: left; width: 49.5%">
|