eliebak HF Staff commited on
Commit
9d66358
·
verified ·
1 Parent(s): 2593281

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -218,9 +218,9 @@ In this section, we report the evaluation results of SmolLM3 model. All evaluati
218
 
219
  We highlight the best score in bold and underline the second-best score.
220
 
221
- ## Instruction Model
222
 
223
- ### No Extended Thinking
224
  Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
225
  | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
226
  |---------|--------|------------|------------|-------------|------------|----------|
@@ -235,7 +235,7 @@ Evaluation results of non reasoning models and reasoning models in no thinking m
235
 
236
  (*): this is a tool calling finetune
237
 
238
- ### Extended Thinking
239
  Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
240
  | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
241
  |---------|--------|------------|------------|----------|
@@ -249,10 +249,10 @@ Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
249
  | Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
250
 
251
 
252
- ## Base Pre-Trained Model
253
  For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
254
 
255
- ### English benchmarks
256
  Note: All evaluations are zero-shot unless stated otherwise.
257
 
258
  | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
@@ -276,7 +276,7 @@ Note: All evaluations are zero-shot unless stated otherwise.
276
  | | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
277
  | | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
278
 
279
- ### Multilingual benchmarks
280
 
281
 
282
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 
218
 
219
  We highlight the best score in bold and underline the second-best score.
220
 
221
+ ### Instruction Model
222
 
223
+ #### No Extended Thinking
224
  Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
225
  | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
226
  |---------|--------|------------|------------|-------------|------------|----------|
 
235
 
236
  (*): this is a tool calling finetune
237
 
238
+ #### Extended Thinking
239
  Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
240
  | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
241
  |---------|--------|------------|------------|----------|
 
249
  | Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
250
 
251
 
252
+ ### Base Pre-Trained Model
253
  For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
254
 
255
+ #### English benchmarks
256
  Note: All evaluations are zero-shot unless stated otherwise.
257
 
258
  | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
 
276
  | | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
277
  | | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
278
 
279
+ #### Multilingual benchmarks
280
 
281
 
282
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |