Update README.md
Browse files
README.md
CHANGED
|
@@ -218,9 +218,9 @@ In this section, we report the evaluation results of SmolLM3 model. All evaluati
|
|
| 218 |
|
| 219 |
We highlight the best score in bold and underline the second-best score.
|
| 220 |
|
| 221 |
-
|
| 222 |
|
| 223 |
-
|
| 224 |
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 225 |
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 226 |
|---------|--------|------------|------------|-------------|------------|----------|
|
|
@@ -235,7 +235,7 @@ Evaluation results of non reasoning models and reasoning models in no thinking m
|
|
| 235 |
|
| 236 |
(*): this is a tool calling finetune
|
| 237 |
|
| 238 |
-
|
| 239 |
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 240 |
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
| 241 |
|---------|--------|------------|------------|----------|
|
|
@@ -249,10 +249,10 @@ Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
|
| 249 |
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
| 250 |
|
| 251 |
|
| 252 |
-
|
| 253 |
For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
| 254 |
|
| 255 |
-
|
| 256 |
Note: All evaluations are zero-shot unless stated otherwise.
|
| 257 |
|
| 258 |
| Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
|
|
@@ -276,7 +276,7 @@ Note: All evaluations are zero-shot unless stated otherwise.
|
|
| 276 |
| | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
|
| 277 |
| | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
|
| 278 |
|
| 279 |
-
|
| 280 |
|
| 281 |
|
| 282 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
|
|
|
| 218 |
|
| 219 |
We highlight the best score in bold and underline the second-best score.
|
| 220 |
|
| 221 |
+
### Instruction Model
|
| 222 |
|
| 223 |
+
#### No Extended Thinking
|
| 224 |
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 225 |
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 226 |
|---------|--------|------------|------------|-------------|------------|----------|
|
|
|
|
| 235 |
|
| 236 |
(*): this is a tool calling finetune
|
| 237 |
|
| 238 |
+
#### Extended Thinking
|
| 239 |
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 240 |
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
| 241 |
|---------|--------|------------|------------|----------|
|
|
|
|
| 249 |
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
| 250 |
|
| 251 |
|
| 252 |
+
### Base Pre-Trained Model
|
| 253 |
For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
| 254 |
|
| 255 |
+
#### English benchmarks
|
| 256 |
Note: All evaluations are zero-shot unless stated otherwise.
|
| 257 |
|
| 258 |
| Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
|
|
|
|
| 276 |
| | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
|
| 277 |
| | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
|
| 278 |
|
| 279 |
+
#### Multilingual benchmarks
|
| 280 |
|
| 281 |
|
| 282 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|