loubnabnl HF Staff commited on
Commit
f5c1855
·
verified ·
1 Parent(s): d5bc3db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -59
README.md CHANGED
@@ -68,6 +68,8 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
68
 
69
  In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
70
 
 
 
71
  ## Base Pre-Trained Model
72
 
73
  ### English benchmarks
@@ -75,66 +77,66 @@ Note: All evaluations are zero-shot unless stated otherwise.
75
 
76
  | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
77
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
78
- | Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 | 75.52 | 60.52 | 74.37 |
79
- | | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | 62.11 |
80
- | | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | 59.59 |
81
- | | CommonsenseQA | 55.28 | 49.14 | **60.60** | 48.98 | 52.99 |
82
- | Knowledge & Understanding | MMLU-CF (Average) | 44.13 | 42.93 | 41.32 | 39.11 | **47.65** |
83
- | | MMLU Pro CF | 19.61 | 16.66 | 16.42 | 18.04 | **24.92** |
84
- | | MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 30.39 | **41.07** |
85
- | | PIQA | **78.89** | 78.35 | 78.51 | 75.35 | 77.58 |
86
- | | OpenBookQA | 40.60 | 40.20 | **42.00** | 36.40 | 42.40 |
87
- | | BoolQ | **78.99** | 73.61 | 75.33 | 74.46 | 74.28 |
88
  | **Math & Code** | | | | | | |
89
- | Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | 43.29| **54.87** |
90
- | | MBPP+ | 52.91 | 52.11 | 38.88| 59.25 | **63.75** |
91
- | | MATH (4-shot) | 46.10 | 40.10 | 7.44 | 41.64 | **51.20** |
92
- | | GSM8k (5-shot) | 67.63 | 70.13 | 25.92 | 65.88 | **74.14** |
93
  | **Long context** | | | | | | |
94
- | | Ruler 32k context | 76.35 | 75.93 | 77.58 | 70.63 | **83.98** |
95
  | | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
96
 
97
  ### Multilingual benchmarks
98
 
99
- We highlight the best and second-best scores in bold.
100
 
101
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
102
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
103
  | Main supported languages | | | | | | | |
104
- | French| MLMM Hellaswag | 63.94 | 57.47 | 57.66 | 51.26 | 61.00 |
105
- | | Belebele | 51.00 | 51.55 | 49.22 |49.44| 55.00 |
106
- | | Global MMLU (CF) | **38.37** | 34.22 | 33.71 | 34.94 |**41.80** |
107
- | | Flores-200 (5-shot) | 62.85| 61.38| **62.89** | 58.68 | **65.76** |
108
- | Spanish| MLMM Hellaswag | 65.85 | 58.25 | 59.39 | 52.40 | 61.85 |
109
- | | Belebele | 47.00 | 48.88 | 47.00 | 47.56 | 50.33 |
110
- | | Global MMLU (CF) | **38.51** | 35.84 | 35.60 | 34.79 |**41.22** |
111
- | | Flores-200 (5-shot) | **48.25**| 50.00| 44.45 | 46.93 | **50.16** |
112
- | German| MLMM Hellaswag | 59.56 | 49.99| 53.19|46.10| 56.43|
113
- | | Belebele | 48.44| 47.88 | 46.22 | 48.00 | 53.44|
114
- | | Global MMLU (CF) | **35.10** | 33.19 | 32.60 | 32.73 |**38.70** |
115
- | | Flores-200 (5-shot) | **56.60**| 50.63| **54.95** | 52.58 | 50.48 |
116
- | Italian| MLMM Hellaswag | 62.49 | 53.21 | 54.96 | 48.72 | 58.76 |
117
- | | Belebele | **46.44** | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
118
- | | Global MMLU (CF) | **36.99** | 33.91 | 32.79 | 35.37 |**39.26** |
119
- | | Flores-200 (5-shot) | **52.65**| **54.87**| 48.83 | 48.37 | 49.11 |
120
- | Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | **59.89** |
121
- | | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | **49.00** |
122
- | | Global MMLU (CF) | **36.88** | 34.72 | 33.05 | 35.26 |**40.66** |
123
- | | Flores-200 (5-shot) | **60.93** |57.68| 54.28 | 56.58 | **63.43** |
124
 
125
  The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
126
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
127
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
128
  | Other supported languages | | | | | | | |
129
- | Arabic| Belebele | 40.22 | 44.22 | **45.33** | 42.33 | **51.78** |
130
- | | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | **29.37** | **31.85** |
131
- | | Flores-200 (5-shot) | **40.22** | 39.44 | **44.43** | 35.82 | 39.76 |
132
- | Chinese| Belebele | 43.78 | 44.56 | **49.56** | 48.78 | **53.22** |
133
- | | Global MMLU (CF) | 36.16 | 33.79 | **39.57** | 38.56 | **44.55** |
134
- | | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | **32.50** |
135
- | Russian| Belebele | **47.44** | 45.89 | **47.44** | 45.22 | **51.44** |
136
- | | Global MMLU (CF) | **36.51** | 32.47 | 34.52 | 34.83 | **38.80** |
137
- | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | **54.70** | **60.53** |
138
 
139
 
140
  ## Instruction Model
@@ -143,25 +145,25 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
143
  Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
144
  | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
145
  |---------|--------|------------|------------|-------------|------------|----------|
146
- | High school math competition | AIME 2025 | **9.3** | 2.9 | 0.3 | 8.0 | **17.1** |
147
- | Math problem-solving | GSM-Plus | 72.8 | **74.1** | 59.2 | 68.3 | **82.1** |
148
- | Competitive programming | LiveCodeBench v4 | **15.2** | 10.5 | 3.4 | 15.0 | **24.9** |
149
- | Graduate-level reasoning | GPQA Diamond | **35.7** | 32.2 | 29.4 | 31.8 | **44.4** |
150
- | Instruction following | IFEval | **76.7** | 65.6 | 71.6 | **74.0** | 68.9 |
151
- | Alignment | MixEval Hard | 26.9 | **27.6** | 24.9 | 24.3 | **31.6** |
152
- | Multilingual Q&A | Global MMLU | **53.5** | - | 46.8 | 49.5 | **65.1** |
153
 
154
  ### Extended Thinking
155
  Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
156
  | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
157
  |---------|--------|------------|------------|----------|
158
- | High school math competition | AIME 2025 | **36.7** | 30.7 | **58.8** |
159
- | Math problem-solving | GSM-Plus | **83.4** | 79.4 | **88.2** |
160
- | Competitive programming | LiveCodeBench v4 | 30.0 | **34.4** | **52.9** |
161
- | Graduate-level reasoning | GPQA Diamond | **41.7** | 39.9 | **55.3** |
162
- | Instruction following | IFEval | 71.2 | **74.2** | **85.4** |
163
- | Alignment | MixEval Hard | 30.8 | **33.9** | **38.0** |
164
- | Multilingual Q&A | Global MMLU | **64.1** | 62.3 | **73.3** |
165
 
166
  ## Training
167
 
 
68
 
69
  In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
70
 
71
+ We highlight the best score in bold and underline the second-best score.
72
+
73
  ## Base Pre-Trained Model
74
 
75
  ### English benchmarks
 
77
 
78
  | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
79
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
80
+ | Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 |<u>75.52</u> | 60.52 | 74.37 |
81
+ | | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | <u>62.11</u> |
82
+ | | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | <u>59.59</u> |
83
+ | | CommonsenseQA | <u>55.28</u> | 49.14 | **60.60** | 48.98 | 52.99 |
84
+ | Knowledge & Understanding | MMLU-CF (Average) | <u>44.13</u> | 42.93 | 41.32 | 39.11 | **47.65** |
85
+ | | MMLU Pro CF | <u>19.61</u> | 16.66 | 16.42 | 18.04 | **24.92** |
86
+ | | MMLU Pro MCF | <u>32.70</u> | 31.32 | 25.07 | 30.39 | **41.07** |
87
+ | | PIQA | **78.89** | 78.35 | <u>78.51</u> | 75.35 | 77.58 |
88
+ | | OpenBookQA | 40.60 | 40.20 | <u>42.00</u> | 36.40 | **42.40** |
89
+ | | BoolQ | **78.99** | 73.61 | <u>75.33</u> | 74.46 | 74.28 |
90
  | **Math & Code** | | | | | | |
91
+ | Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | <u>43.29</u>| **54.87** |
92
+ | | MBPP+ | 52.91 | 52.11 | 38.88| <u>59.25</u> | **63.75** |
93
+ | | MATH (4-shot) | <u>46.10</u> | 40.10 | 7.44 | 41.64 | **51.20** |
94
+ | | GSM8k (5-shot) | 67.63 | <u>70.13</u> | 25.92 | 65.88 | **74.14** |
95
  | **Long context** | | | | | | |
96
+ | | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
97
  | | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
98
 
99
  ### Multilingual benchmarks
100
 
101
+
102
 
103
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
104
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
105
  | Main supported languages | | | | | | | |
106
+ | French| MLMM Hellaswag | **63.94** | 57.47 | 57.66 | 51.26 | <u>61.00</u> |
107
+ | | Belebele | 51.00 | <u>51.55</u> | 49.22 |49.44| **55.00** |
108
+ | | Global MMLU (CF) | <u>38.37</u> | 34.22 | 33.71 | 34.94 |**41.80** |
109
+ | | Flores-200 (5-shot) | 62.85| 61.38| <u>62.89<u/u> | 58.68 | **65.76** |
110
+ | Spanish| MLMM Hellaswag | **65.85** | 58.25 | 59.39 | 52.40 | <u>61.85</u> |
111
+ | | Belebele | 47.00 | <u>48.88</u> | 47.00 | 47.56 | **50.33** |
112
+ | | Global MMLU (CF) | <u>38.51</u> | 35.84 | 35.60 | 34.79 |**41.22** |
113
+ | | Flores-200 (5-shot) | <u>48.25</u>| 50.00| 44.45 | 46.93 | **50.16** |
114
+ | German| MLMM Hellaswag | **59.56** | 49.99| 53.19|46.10| <u>56.43</u>|
115
+ | | Belebele | <u>48.44</u> | 47.88 | 46.22 | 48.00 | **53.44**|
116
+ | | Global MMLU (CF) | <u>35.10</u> | 33.19 | 32.60 | 32.73 |**38.70** |
117
+ | | Flores-200 (5-shot) | **56.60**| 50.63| <u>54.95</u> | 52.58 | 50.48 |
118
+ | Italian| MLMM Hellaswag | **62.49** | 53.21 | 54.96 | 48.72 | <u>58.76</u> |
119
+ | | Belebele | <u>46.44</u> | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
120
+ | | Global MMLU (CF) | <u>36.99</u> | 33.91 | 32.79 | 35.37 |**39.26** |
121
+ | | Flores-200 (5-shot) | <u>52.65<u/>| **54.87**| 48.83 | 48.37 | 49.11 |
122
+ | Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | <u>59.89</u> |
123
+ | | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | <u>49.00</U> |
124
+ | | Global MMLU (CF) | <u>36.88</u> | 34.72 | 33.05 | 35.26 |**40.66** |
125
+ | | Flores-200 (5-shot) | <u>60.93</u> |57.68| 54.28 | 56.58 | **63.43** |
126
 
127
  The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
128
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
129
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
130
  | Other supported languages | | | | | | | |
131
+ | Arabic| Belebele | 40.22 | 44.22 | <u>45.33</u> | 42.33 | **51.78** |
132
+ | | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | <u>29.37</u> | **31.85** |
133
+ | | Flores-200 (5-shot) | <u>40.22</u> | 39.44 | **44.43** | 35.82 | 39.76 |
134
+ | Chinese| Belebele | 43.78 | 44.56 | <u>49.56</u> | 48.78 | **53.22** |
135
+ | | Global MMLU (CF) | 36.16 | 33.79 | <u>39.57</u> | 38.56 | **44.55** |
136
+ | | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | <u>32.50</u> |
137
+ | Russian| Belebele | <u>47.44</u> | 45.89 | <u>47.44</u> | 45.22 | **51.44** |
138
+ | | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
139
+ | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
140
 
141
 
142
  ## Instruction Model
 
145
  Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
146
  | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
147
  |---------|--------|------------|------------|-------------|------------|----------|
148
+ | High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
149
+ | Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
150
+ | Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
151
+ | Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
152
+ | Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
153
+ | Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
154
+ | Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
155
 
156
  ### Extended Thinking
157
  Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
158
  | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
159
  |---------|--------|------------|------------|----------|
160
+ | High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
161
+ | Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
162
+ | Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
163
+ | Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
164
+ | Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
165
+ | Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
166
+ | Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
167
 
168
  ## Training
169