Update README.md
Browse files
README.md
CHANGED
|
@@ -160,6 +160,108 @@ SmolLM3-3B is a decoder-only transformer with several advanced features:
|
|
| 160 |
| **Activation** | SiLU (Sigmoid Linear Unit) |
|
| 161 |
| **Embeddings** | Tied (input = output projection) |
|
| 162 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
### Key Architectural Innovation: NoPE (No Position Encoding)
|
| 164 |
|
| 165 |
SmolLM3 uses a 3:1 NoPE ratio — 75% of layers have **no positional encoding** at all. Only layers 3, 7, 11, 15, 19, 23, 27, 31, 35 apply RoPE. This reduces computational overhead and enables better long-context generalization.
|
|
|
|
| 160 |
| **Activation** | SiLU (Sigmoid Linear Unit) |
|
| 161 |
| **Embeddings** | Tied (input = output projection) |
|
| 162 |
|
| 163 |
+
|
| 164 |
+
### Instruction Model
|
| 165 |
+
|
| 166 |
+
#### No Extended Thinking
|
| 167 |
+
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 168 |
+
| Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 169 |
+
|---------|--------|------------|------------|-------------|------------|----------|
|
| 170 |
+
| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
|
| 171 |
+
| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
|
| 172 |
+
| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
|
| 173 |
+
| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
|
| 174 |
+
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
|
| 175 |
+
| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
|
| 176 |
+
| Tool Calling | BFCL| <u>92.3</u> | - | <u>92.3</u> * | 89.5 | **95.0** |
|
| 177 |
+
| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
|
| 178 |
+
|
| 179 |
+
(*): this is a tool calling finetune
|
| 180 |
+
|
| 181 |
+
#### Extended Thinking
|
| 182 |
+
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 183 |
+
| Category | Metric | QORA-LLM-3B | Qwen3-1.7B | Qwen3-4B |
|
| 184 |
+
|---------|--------|------------|------------|----------|
|
| 185 |
+
| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
|
| 186 |
+
| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
|
| 187 |
+
| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
|
| 188 |
+
| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
|
| 189 |
+
| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
|
| 190 |
+
| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
|
| 191 |
+
| Tool Calling | BFCL | <u>88.8</u> | <u>88.8</u> | **95.5** |
|
| 192 |
+
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
### Base Pre-Trained Model
|
| 196 |
+
|
| 197 |
+
#### English benchmarks
|
| 198 |
+
Note: All evaluations are zero-shot unless stated otherwise. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
| 199 |
+
|
| 200 |
+
| Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
|
| 201 |
+
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 202 |
+
| Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 |<u>75.52</u> | 60.52 | 74.37 |
|
| 203 |
+
| | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | <u>62.11</u> |
|
| 204 |
+
| | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | <u>59.59</u> |
|
| 205 |
+
| | CommonsenseQA | <u>55.28</u> | 49.14 | **60.60** | 48.98 | 52.99 |
|
| 206 |
+
| Knowledge & Understanding | MMLU-CF (Average) | <u>44.13</u> | 42.93 | 41.32 | 39.11 | **47.65** |
|
| 207 |
+
| | MMLU Pro CF | <u>19.61</u> | 16.66 | 16.42 | 18.04 | **24.92** |
|
| 208 |
+
| | MMLU Pro MCF | <u>32.70</u> | 31.32 | 25.07 | 30.39 | **41.07** |
|
| 209 |
+
| | PIQA | **78.89** | 78.35 | <u>78.51</u> | 75.35 | 77.58 |
|
| 210 |
+
| | OpenBookQA | 40.60 | 40.20 | <u>42.00</u> | 36.40 | **42.40** |
|
| 211 |
+
| | BoolQ | **78.99** | 73.61 | <u>75.33</u> | 74.46 | 74.28 |
|
| 212 |
+
| **Math & Code** | | | | | | |
|
| 213 |
+
| Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | <u>43.29</u>| **54.87** |
|
| 214 |
+
| | MBPP+ | 52.91 | 52.11 | 38.88| <u>59.25</u> | **63.75** |
|
| 215 |
+
| | MATH (4-shot) | <u>46.10</u> | 40.10 | 7.44 | 41.64 | **51.20** |
|
| 216 |
+
| | GSM8k (5-shot) | 67.63 | <u>70.13</u> | 25.92 | 65.88 | **74.14** |
|
| 217 |
+
| **Long context** | | | | | | |
|
| 218 |
+
| | Ruler 32k | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
|
| 219 |
+
| | Ruler 64k | <u>67.85</u> | 64.90 | **72.93** | 57.18 | 60.29 |
|
| 220 |
+
| | Ruler 128k | 61.03 | <u>62.23</u> | **71.30** | 43.03 | 47.23 |
|
| 221 |
+
|
| 222 |
+
|
| 223 |
+
#### Multilingual benchmarks
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
| Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 227 |
+
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 228 |
+
| Main supported languages | | | | | | | |
|
| 229 |
+
| French| MLMM Hellaswag | **63.94** | 57.47 | 57.66 | 51.26 | <u>61.00</u> |
|
| 230 |
+
| | Belebele | 51.00 | <u>51.55</u> | 49.22 |49.44| **55.00** |
|
| 231 |
+
| | Global MMLU (CF) | <u>38.37</u> | 34.22 | 33.71 | 34.94 |**41.80** |
|
| 232 |
+
| | Flores-200 (5-shot) | 62.85| 61.38| <u>62.89</u> | 58.68 | **65.76** |
|
| 233 |
+
| Spanish| MLMM Hellaswag | **65.85** | 58.25 | 59.39 | 52.40 | <u>61.85</u> |
|
| 234 |
+
| | Belebele | 47.00 | <u>48.88</u> | 47.00 | 47.56 | **50.33** |
|
| 235 |
+
| | Global MMLU (CF) | <u>38.51</u> | 35.84 | 35.60 | 34.79 |**41.22** |
|
| 236 |
+
| | Flores-200 (5-shot) | <u>48.25</u>| 50.00| 44.45 | 46.93 | **50.16** |
|
| 237 |
+
| German| MLMM Hellaswag | **59.56** | 49.99| 53.19|46.10| <u>56.43</u>|
|
| 238 |
+
| | Belebele | <u>48.44</u> | 47.88 | 46.22 | 48.00 | **53.44**|
|
| 239 |
+
| | Global MMLU (CF) | <u>35.10</u> | 33.19 | 32.60 | 32.73 |**38.70** |
|
| 240 |
+
| | Flores-200 (5-shot) | **56.60**| 50.63| <u>54.95</u> | 52.58 | 50.48 |
|
| 241 |
+
| Italian| MLMM Hellaswag | **62.49** | 53.21 | 54.96 | 48.72 | <u>58.76</u> |
|
| 242 |
+
| | Belebele | <u>46.44</u> | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
|
| 243 |
+
| | Global MMLU (CF) | <u>36.99</u> | 33.91 | 32.79 | 35.37 |**39.26** |
|
| 244 |
+
| | Flores-200 (5-shot) | <u>52.65<u/>| **54.87**| 48.83 | 48.37 | 49.11 |
|
| 245 |
+
| Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | <u>59.89</u> |
|
| 246 |
+
| | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | <u>49.00</U> |
|
| 247 |
+
| | Global MMLU (CF) | <u>36.88</u> | 34.72 | 33.05 | 35.26 |**40.66** |
|
| 248 |
+
| | Flores-200 (5-shot) | <u>60.93</u> |57.68| 54.28 | 56.58 | **63.43** |
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
|
| 252 |
+
| Category | Metric | QORA-LLM-3B | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 253 |
+
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 254 |
+
| Other supported languages | | | | | | | |
|
| 255 |
+
| Arabic| Belebele | 40.22 | 44.22 | <u>45.33</u> | 42.33 | **51.78** |
|
| 256 |
+
| | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | <u>29.37</u> | **31.85** |
|
| 257 |
+
| | Flores-200 (5-shot) | <u>40.22</u> | 39.44 | **44.43** | 35.82 | 39.76 |
|
| 258 |
+
| Chinese| Belebele | 43.78 | 44.56 | <u>49.56</u> | 48.78 | **53.22** |
|
| 259 |
+
| | Global MMLU (CF) | 36.16 | 33.79 | <u>39.57</u> | 38.56 | **44.55** |
|
| 260 |
+
| | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | <u>32.50</u> |
|
| 261 |
+
| Russian| Belebele | <u>47.44</u> | 45.89 | <u>47.44</u> | 45.22 | **51.44** |
|
| 262 |
+
| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
|
| 263 |
+
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
|
| 264 |
+
|
| 265 |
### Key Architectural Innovation: NoPE (No Position Encoding)
|
| 266 |
|
| 267 |
SmolLM3 uses a 3:1 NoPE ratio — 75% of layers have **no positional encoding** at all. Only layers 3, 7, 11, 15, 19, 23, 27, 31, 35 apply RoPE. This reduces computational overhead and enables better long-context generalization.
|