drdraq commited on
Commit
f138eaa
·
verified ·
1 Parent(s): c5f5c5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md CHANGED
@@ -160,6 +160,108 @@ SmolLM3-3B is a decoder-only transformer with several advanced features:
160
  | **Activation** | SiLU (Sigmoid Linear Unit) |
161
  | **Embeddings** | Tied (input = output projection) |
162
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ### Key Architectural Innovation: NoPE (No Position Encoding)
164
 
165
  SmolLM3 uses a 3:1 NoPE ratio — 75% of layers have **no positional encoding** at all. Only layers 3, 7, 11, 15, 19, 23, 27, 31, 35 apply RoPE. This reduces computational overhead and enables better long-context generalization.
 
160
  | **Activation** | SiLU (Sigmoid Linear Unit) |
161
  | **Embeddings** | Tied (input = output projection) |
162
 
163
+
164
+ ### Instruction Model
165
+
166
+ #### No Extended Thinking
167
+ Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
168
+ | Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
169
+ |---------|--------|------------|------------|-------------|------------|----------|
170
+ | High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
171
+ | Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
172
+ | Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
173
+ | Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
174
+ | Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
175
+ | Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
176
+ | Tool Calling | BFCL| <u>92.3</u> | - | <u>92.3</u> * | 89.5 | **95.0** |
177
+ | Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
178
+
179
+ (*): this is a tool calling finetune
180
+
181
+ #### Extended Thinking
182
+ Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
183
+ | Category | Metric | QORA-LLM-3B | Qwen3-1.7B | Qwen3-4B |
184
+ |---------|--------|------------|------------|----------|
185
+ | High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
186
+ | Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
187
+ | Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
188
+ | Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
189
+ | Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
190
+ | Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
191
+ | Tool Calling | BFCL | <u>88.8</u> | <u>88.8</u> | **95.5** |
192
+ | Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
193
+
194
+
195
+ ### Base Pre-Trained Model
196
+
197
+ #### English benchmarks
198
+ Note: All evaluations are zero-shot unless stated otherwise. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
199
+
200
+ | Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
201
+ |---------|--------|---------------------|------------|--------------|------------------|---------------|
202
+ | Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 |<u>75.52</u> | 60.52 | 74.37 |
203
+ | | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | <u>62.11</u> |
204
+ | | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | <u>59.59</u> |
205
+ | | CommonsenseQA | <u>55.28</u> | 49.14 | **60.60** | 48.98 | 52.99 |
206
+ | Knowledge & Understanding | MMLU-CF (Average) | <u>44.13</u> | 42.93 | 41.32 | 39.11 | **47.65** |
207
+ | | MMLU Pro CF | <u>19.61</u> | 16.66 | 16.42 | 18.04 | **24.92** |
208
+ | | MMLU Pro MCF | <u>32.70</u> | 31.32 | 25.07 | 30.39 | **41.07** |
209
+ | | PIQA | **78.89** | 78.35 | <u>78.51</u> | 75.35 | 77.58 |
210
+ | | OpenBookQA | 40.60 | 40.20 | <u>42.00</u> | 36.40 | **42.40** |
211
+ | | BoolQ | **78.99** | 73.61 | <u>75.33</u> | 74.46 | 74.28 |
212
+ | **Math & Code** | | | | | | |
213
+ | Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | <u>43.29</u>| **54.87** |
214
+ | | MBPP+ | 52.91 | 52.11 | 38.88| <u>59.25</u> | **63.75** |
215
+ | | MATH (4-shot) | <u>46.10</u> | 40.10 | 7.44 | 41.64 | **51.20** |
216
+ | | GSM8k (5-shot) | 67.63 | <u>70.13</u> | 25.92 | 65.88 | **74.14** |
217
+ | **Long context** | | | | | | |
218
+ | | Ruler 32k | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
219
+ | | Ruler 64k | <u>67.85</u> | 64.90 | **72.93** | 57.18 | 60.29 |
220
+ | | Ruler 128k | 61.03 | <u>62.23</u> | **71.30** | 43.03 | 47.23 |
221
+
222
+
223
+ #### Multilingual benchmarks
224
+
225
+
226
+ | Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
227
+ |---------|--------|---------------------|------------|--------------|------------------|---------------|
228
+ | Main supported languages | | | | | | | |
229
+ | French| MLMM Hellaswag | **63.94** | 57.47 | 57.66 | 51.26 | <u>61.00</u> |
230
+ | | Belebele | 51.00 | <u>51.55</u> | 49.22 |49.44| **55.00** |
231
+ | | Global MMLU (CF) | <u>38.37</u> | 34.22 | 33.71 | 34.94 |**41.80** |
232
+ | | Flores-200 (5-shot) | 62.85| 61.38| <u>62.89</u> | 58.68 | **65.76** |
233
+ | Spanish| MLMM Hellaswag | **65.85** | 58.25 | 59.39 | 52.40 | <u>61.85</u> |
234
+ | | Belebele | 47.00 | <u>48.88</u> | 47.00 | 47.56 | **50.33** |
235
+ | | Global MMLU (CF) | <u>38.51</u> | 35.84 | 35.60 | 34.79 |**41.22** |
236
+ | | Flores-200 (5-shot) | <u>48.25</u>| 50.00| 44.45 | 46.93 | **50.16** |
237
+ | German| MLMM Hellaswag | **59.56** | 49.99| 53.19|46.10| <u>56.43</u>|
238
+ | | Belebele | <u>48.44</u> | 47.88 | 46.22 | 48.00 | **53.44**|
239
+ | | Global MMLU (CF) | <u>35.10</u> | 33.19 | 32.60 | 32.73 |**38.70** |
240
+ | | Flores-200 (5-shot) | **56.60**| 50.63| <u>54.95</u> | 52.58 | 50.48 |
241
+ | Italian| MLMM Hellaswag | **62.49** | 53.21 | 54.96 | 48.72 | <u>58.76</u> |
242
+ | | Belebele | <u>46.44</u> | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
243
+ | | Global MMLU (CF) | <u>36.99</u> | 33.91 | 32.79 | 35.37 |**39.26** |
244
+ | | Flores-200 (5-shot) | <u>52.65<u/>| **54.87**| 48.83 | 48.37 | 49.11 |
245
+ | Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | <u>59.89</u> |
246
+ | | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | <u>49.00</U> |
247
+ | | Global MMLU (CF) | <u>36.88</u> | 34.72 | 33.05 | 35.26 |**40.66** |
248
+ | | Flores-200 (5-shot) | <u>60.93</u> |57.68| 54.28 | 56.58 | **63.43** |
249
+
250
+
251
+ The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
252
+ | Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
253
+ |---------|--------|---------------------|------------|--------------|------------------|---------------|
254
+ | Other supported languages | | | | | | | |
255
+ | Arabic| Belebele | 40.22 | 44.22 | <u>45.33</u> | 42.33 | **51.78** |
256
+ | | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | <u>29.37</u> | **31.85** |
257
+ | | Flores-200 (5-shot) | <u>40.22</u> | 39.44 | **44.43** | 35.82 | 39.76 |
258
+ | Chinese| Belebele | 43.78 | 44.56 | <u>49.56</u> | 48.78 | **53.22** |
259
+ | | Global MMLU (CF) | 36.16 | 33.79 | <u>39.57</u> | 38.56 | **44.55** |
260
+ | | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | <u>32.50</u> |
261
+ | Russian| Belebele | <u>47.44</u> | 45.89 | <u>47.44</u> | 45.22 | **51.44** |
262
+ | | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
263
+ | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
264
+
265
  ### Key Architectural Innovation: NoPE (No Position Encoding)
266
 
267
  SmolLM3 uses a 3:1 NoPE ratio — 75% of layers have **no positional encoding** at all. Only layers 3, 7, 11, 15, 19, 23, 27, 31, 35 apply RoPE. This reduces computational overhead and enables better long-context generalization.