Update README.md
Browse files
README.md
CHANGED
|
@@ -68,6 +68,8 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
|
|
| 68 |
|
| 69 |
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
| 70 |
|
|
|
|
|
|
|
| 71 |
## Base Pre-Trained Model
|
| 72 |
|
| 73 |
### English benchmarks
|
|
@@ -75,66 +77,66 @@ Note: All evaluations are zero-shot unless stated otherwise.
|
|
| 75 |
|
| 76 |
| Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
|
| 77 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 78 |
-
| Reasoning & Commonsense| HellaSwag | **76.15** | 74.19
|
| 79 |
-
| | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | 62.11 |
|
| 80 |
-
| | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | 59.59 |
|
| 81 |
-
| | CommonsenseQA | 55.28 | 49.14 | **60.60** | 48.98 | 52.99 |
|
| 82 |
-
| Knowledge & Understanding | MMLU-CF (Average) | 44.13 | 42.93 | 41.32 | 39.11 | **47.65** |
|
| 83 |
-
| | MMLU Pro CF | 19.61 | 16.66 | 16.42 | 18.04 | **24.92** |
|
| 84 |
-
| | MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 30.39 | **41.07** |
|
| 85 |
-
| | PIQA | **78.89** | 78.35 | 78.51 | 75.35 | 77.58 |
|
| 86 |
-
| | OpenBookQA | 40.60 | 40.20 |
|
| 87 |
-
| | BoolQ | **78.99** | 73.61 | 75.33 | 74.46 | 74.28 |
|
| 88 |
| **Math & Code** | | | | | | |
|
| 89 |
-
| Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | 43.29
|
| 90 |
-
| | MBPP+ | 52.91 | 52.11 | 38.88| 59.25 | **63.75** |
|
| 91 |
-
| | MATH (4-shot) | 46.10 | 40.10 | 7.44 | 41.64 | **51.20** |
|
| 92 |
-
| | GSM8k (5-shot) | 67.63 | 70.13 | 25.92 | 65.88 | **74.14** |
|
| 93 |
| **Long context** | | | | | | |
|
| 94 |
-
| | Ruler 32k context | 76.35 | 75.93 | 77.58 | 70.63 | **83.98** |
|
| 95 |
| | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
|
| 96 |
|
| 97 |
### Multilingual benchmarks
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 102 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 103 |
| Main supported languages | | | | | | | |
|
| 104 |
-
| French| MLMM Hellaswag | 63.94 | 57.47 | 57.66 | 51.26 | 61.00 |
|
| 105 |
-
| | Belebele | 51.00 | 51.55 | 49.22 |49.44| 55.00 |
|
| 106 |
-
| | Global MMLU (CF) |
|
| 107 |
-
| | Flores-200 (5-shot) | 62.85| 61.38|
|
| 108 |
-
| Spanish| MLMM Hellaswag | 65.85 | 58.25 | 59.39 | 52.40 | 61.85 |
|
| 109 |
-
| | Belebele | 47.00 | 48.88 | 47.00 | 47.56 | 50.33 |
|
| 110 |
-
| | Global MMLU (CF) |
|
| 111 |
-
| | Flores-200 (5-shot) |
|
| 112 |
-
| German| MLMM Hellaswag | 59.56 | 49.99| 53.19|46.10| 56.43
|
| 113 |
-
| | Belebele | 48.44| 47.88 | 46.22 | 48.00 | 53.44
|
| 114 |
-
| | Global MMLU (CF) |
|
| 115 |
-
| | Flores-200 (5-shot) | **56.60**| 50.63|
|
| 116 |
-
| Italian| MLMM Hellaswag | 62.49 | 53.21 | 54.96 | 48.72 | 58.76 |
|
| 117 |
-
| | Belebele |
|
| 118 |
-
| | Global MMLU (CF) |
|
| 119 |
-
| | Flores-200 (5-shot) |
|
| 120 |
-
| Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 |
|
| 121 |
-
| | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 |
|
| 122 |
-
| | Global MMLU (CF) |
|
| 123 |
-
| | Flores-200 (5-shot) |
|
| 124 |
|
| 125 |
The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
|
| 126 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 127 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 128 |
| Other supported languages | | | | | | | |
|
| 129 |
-
| Arabic| Belebele | 40.22 | 44.22 |
|
| 130 |
-
| | Global MMLU (CF) | 28.57 | 28.81 | 27.67 |
|
| 131 |
-
| | Flores-200 (5-shot) |
|
| 132 |
-
| Chinese| Belebele | 43.78 | 44.56 |
|
| 133 |
-
| | Global MMLU (CF) | 36.16 | 33.79 |
|
| 134 |
-
| | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 |
|
| 135 |
-
| Russian| Belebele |
|
| 136 |
-
| | Global MMLU (CF) |
|
| 137 |
-
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 |
|
| 138 |
|
| 139 |
|
| 140 |
## Instruction Model
|
|
@@ -143,25 +145,25 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
|
|
| 143 |
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 144 |
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 145 |
|---------|--------|------------|------------|-------------|------------|----------|
|
| 146 |
-
| High school math competition | AIME 2025 |
|
| 147 |
-
| Math problem-solving | GSM-Plus | 72.8 |
|
| 148 |
-
| Competitive programming | LiveCodeBench v4 |
|
| 149 |
-
| Graduate-level reasoning | GPQA Diamond |
|
| 150 |
-
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 |
|
| 151 |
-
| Alignment | MixEval Hard | 26.9 |
|
| 152 |
-
| Multilingual Q&A | Global MMLU |
|
| 153 |
|
| 154 |
### Extended Thinking
|
| 155 |
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 156 |
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
| 157 |
|---------|--------|------------|------------|----------|
|
| 158 |
-
| High school math competition | AIME 2025 |
|
| 159 |
-
| Math problem-solving | GSM-Plus |
|
| 160 |
-
| Competitive programming | LiveCodeBench v4 | 30.0 |
|
| 161 |
-
| Graduate-level reasoning | GPQA Diamond |
|
| 162 |
-
| Instruction following | IFEval | 71.2 |
|
| 163 |
-
| Alignment | MixEval Hard | 30.8 |
|
| 164 |
-
| Multilingual Q&A | Global MMLU |
|
| 165 |
|
| 166 |
## Training
|
| 167 |
|
|
|
|
| 68 |
|
| 69 |
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
| 70 |
|
| 71 |
+
We highlight the best score in bold and underline the second-best score.
|
| 72 |
+
|
| 73 |
## Base Pre-Trained Model
|
| 74 |
|
| 75 |
### English benchmarks
|
|
|
|
| 77 |
|
| 78 |
| Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
|
| 79 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 80 |
+
| Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 |<u>75.52</u> | 60.52 | 74.37 |
|
| 81 |
+
| | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | <u>62.11</u> |
|
| 82 |
+
| | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | <u>59.59</u> |
|
| 83 |
+
| | CommonsenseQA | <u>55.28</u> | 49.14 | **60.60** | 48.98 | 52.99 |
|
| 84 |
+
| Knowledge & Understanding | MMLU-CF (Average) | <u>44.13</u> | 42.93 | 41.32 | 39.11 | **47.65** |
|
| 85 |
+
| | MMLU Pro CF | <u>19.61</u> | 16.66 | 16.42 | 18.04 | **24.92** |
|
| 86 |
+
| | MMLU Pro MCF | <u>32.70</u> | 31.32 | 25.07 | 30.39 | **41.07** |
|
| 87 |
+
| | PIQA | **78.89** | 78.35 | <u>78.51</u> | 75.35 | 77.58 |
|
| 88 |
+
| | OpenBookQA | 40.60 | 40.20 | <u>42.00</u> | 36.40 | **42.40** |
|
| 89 |
+
| | BoolQ | **78.99** | 73.61 | <u>75.33</u> | 74.46 | 74.28 |
|
| 90 |
| **Math & Code** | | | | | | |
|
| 91 |
+
| Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | <u>43.29</u>| **54.87** |
|
| 92 |
+
| | MBPP+ | 52.91 | 52.11 | 38.88| <u>59.25</u> | **63.75** |
|
| 93 |
+
| | MATH (4-shot) | <u>46.10</u> | 40.10 | 7.44 | 41.64 | **51.20** |
|
| 94 |
+
| | GSM8k (5-shot) | 67.63 | <u>70.13</u> | 25.92 | 65.88 | **74.14** |
|
| 95 |
| **Long context** | | | | | | |
|
| 96 |
+
| | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
|
| 97 |
| | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
|
| 98 |
|
| 99 |
### Multilingual benchmarks
|
| 100 |
|
| 101 |
+
|
| 102 |
|
| 103 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 104 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 105 |
| Main supported languages | | | | | | | |
|
| 106 |
+
| French| MLMM Hellaswag | **63.94** | 57.47 | 57.66 | 51.26 | <u>61.00</u> |
|
| 107 |
+
| | Belebele | 51.00 | <u>51.55</u> | 49.22 |49.44| **55.00** |
|
| 108 |
+
| | Global MMLU (CF) | <u>38.37</u> | 34.22 | 33.71 | 34.94 |**41.80** |
|
| 109 |
+
| | Flores-200 (5-shot) | 62.85| 61.38| <u>62.89<u/u> | 58.68 | **65.76** |
|
| 110 |
+
| Spanish| MLMM Hellaswag | **65.85** | 58.25 | 59.39 | 52.40 | <u>61.85</u> |
|
| 111 |
+
| | Belebele | 47.00 | <u>48.88</u> | 47.00 | 47.56 | **50.33** |
|
| 112 |
+
| | Global MMLU (CF) | <u>38.51</u> | 35.84 | 35.60 | 34.79 |**41.22** |
|
| 113 |
+
| | Flores-200 (5-shot) | <u>48.25</u>| 50.00| 44.45 | 46.93 | **50.16** |
|
| 114 |
+
| German| MLMM Hellaswag | **59.56** | 49.99| 53.19|46.10| <u>56.43</u>|
|
| 115 |
+
| | Belebele | <u>48.44</u> | 47.88 | 46.22 | 48.00 | **53.44**|
|
| 116 |
+
| | Global MMLU (CF) | <u>35.10</u> | 33.19 | 32.60 | 32.73 |**38.70** |
|
| 117 |
+
| | Flores-200 (5-shot) | **56.60**| 50.63| <u>54.95</u> | 52.58 | 50.48 |
|
| 118 |
+
| Italian| MLMM Hellaswag | **62.49** | 53.21 | 54.96 | 48.72 | <u>58.76</u> |
|
| 119 |
+
| | Belebele | <u>46.44</u> | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
|
| 120 |
+
| | Global MMLU (CF) | <u>36.99</u> | 33.91 | 32.79 | 35.37 |**39.26** |
|
| 121 |
+
| | Flores-200 (5-shot) | <u>52.65<u/>| **54.87**| 48.83 | 48.37 | 49.11 |
|
| 122 |
+
| Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | <u>59.89</u> |
|
| 123 |
+
| | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | <u>49.00</U> |
|
| 124 |
+
| | Global MMLU (CF) | <u>36.88</u> | 34.72 | 33.05 | 35.26 |**40.66** |
|
| 125 |
+
| | Flores-200 (5-shot) | <u>60.93</u> |57.68| 54.28 | 56.58 | **63.43** |
|
| 126 |
|
| 127 |
The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
|
| 128 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
| 129 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
| 130 |
| Other supported languages | | | | | | | |
|
| 131 |
+
| Arabic| Belebele | 40.22 | 44.22 | <u>45.33</u> | 42.33 | **51.78** |
|
| 132 |
+
| | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | <u>29.37</u> | **31.85** |
|
| 133 |
+
| | Flores-200 (5-shot) | <u>40.22</u> | 39.44 | **44.43** | 35.82 | 39.76 |
|
| 134 |
+
| Chinese| Belebele | 43.78 | 44.56 | <u>49.56</u> | 48.78 | **53.22** |
|
| 135 |
+
| | Global MMLU (CF) | 36.16 | 33.79 | <u>39.57</u> | 38.56 | **44.55** |
|
| 136 |
+
| | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | <u>32.50</u> |
|
| 137 |
+
| Russian| Belebele | <u>47.44</u> | 45.89 | <u>47.44</u> | 45.22 | **51.44** |
|
| 138 |
+
| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
|
| 139 |
+
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
|
| 140 |
|
| 141 |
|
| 142 |
## Instruction Model
|
|
|
|
| 145 |
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
| 146 |
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
| 147 |
|---------|--------|------------|------------|-------------|------------|----------|
|
| 148 |
+
| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
|
| 149 |
+
| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
|
| 150 |
+
| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
|
| 151 |
+
| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
|
| 152 |
+
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
|
| 153 |
+
| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
|
| 154 |
+
| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
|
| 155 |
|
| 156 |
### Extended Thinking
|
| 157 |
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
| 158 |
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
| 159 |
|---------|--------|------------|------------|----------|
|
| 160 |
+
| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
|
| 161 |
+
| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
|
| 162 |
+
| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
|
| 163 |
+
| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
|
| 164 |
+
| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
|
| 165 |
+
| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
|
| 166 |
+
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
| 167 |
|
| 168 |
## Training
|
| 169 |
|