HuggingFaceTB
/

SmolLM3-3B-Base

@@ -68,6 +68,8 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
 In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
 ## Base Pre-Trained Model
 ### English benchmarks
@@ -75,66 +77,66 @@ Note: All evaluations are zero-shot unless stated otherwise.
 | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
-| Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 | 75.52 | 60.52 | 74.37 |
-| | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | 62.11 |
-| | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | 59.59 |
-| | CommonsenseQA | 55.28 | 49.14 | **60.60** | 48.98 | 52.99 |
-| Knowledge & Understanding | MMLU-CF (Average) | 44.13 | 42.93 | 41.32 | 39.11 | **47.65** |
-| | MMLU Pro CF | 19.61 | 16.66 | 16.42 | 18.04 | **24.92** |
-| | MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 30.39 | **41.07** |
-| | PIQA | **78.89** | 78.35 | 78.51 | 75.35 | 77.58 |
-| | OpenBookQA | 40.60 | 40.20 | **42.00** | 36.40 | 42.40 |
-| | BoolQ | **78.99** | 73.61 | 75.33 | 74.46 | 74.28 |
 | **Math & Code** |  |  |  |  |  |  |
-| Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | 43.29| **54.87** |
-| | MBPP+ | 52.91 | 52.11 | 38.88| 59.25 | **63.75** |
-| | MATH (4-shot) | 46.10 | 40.10 | 7.44 | 41.64 | **51.20** |
-| | GSM8k (5-shot) | 67.63 | 70.13 | 25.92 | 65.88 | **74.14** |
 | **Long context** |  |  |  |  |  |  |
-| | Ruler 32k context | 76.35 | 75.93 | 77.58 | 70.63 | **83.98** |
 | | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
 ### Multilingual benchmarks
-We highlight the best and second-best scores in bold.
 | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
 | Main supported languages |  |  |  |  |  |  |  |
-| French| MLMM Hellaswag | 63.94 | 57.47 | 57.66 | 51.26 | 61.00 |
-| | Belebele | 51.00 | 51.55 | 49.22 |49.44| 55.00 |
-| | Global MMLU (CF) | **38.37** | 34.22  | 33.71 | 34.94  |**41.80** |
-| | Flores-200 (5-shot) | 62.85| 61.38| **62.89** | 58.68 | **65.76** |
-| Spanish| MLMM Hellaswag | 65.85 | 58.25 | 59.39 | 52.40 | 61.85 |
-| | Belebele | 47.00 | 48.88 | 47.00 | 47.56 | 50.33 |
-| | Global MMLU (CF) | **38.51** | 35.84  | 35.60 | 34.79  |**41.22** |
-| | Flores-200 (5-shot) | **48.25**| 50.00| 44.45 | 46.93 | **50.16** |
-| German| MLMM Hellaswag | 59.56 | 49.99|  53.19|46.10| 56.43|
-| | Belebele | 48.44| 47.88 | 46.22 | 48.00 | 53.44|
-| | Global MMLU (CF) | **35.10** | 33.19  | 32.60 | 32.73  |**38.70** |
-| | Flores-200 (5-shot) | **56.60**| 50.63| **54.95** | 52.58 | 50.48 |
-| Italian| MLMM Hellaswag | 62.49 | 53.21 | 54.96 | 48.72 | 58.76 |
-| | Belebele | **46.44** | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
-| | Global MMLU (CF) | **36.99** | 33.91  | 32.79 | 35.37  |**39.26** |
-| | Flores-200 (5-shot) | **52.65**| **54.87**| 48.83 | 48.37 | 49.11 |
-| Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | **59.89** |
-| | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | **49.00** |
-| | Global MMLU (CF) | **36.88** | 34.72  | 33.05 | 35.26  |**40.66** |
-| | Flores-200 (5-shot) | **60.93** |57.68| 54.28 | 56.58 | **63.43** |
 The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
 | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
 | Other supported languages |  |  |  |  |  |  |  |
-| Arabic| Belebele | 40.22 | 44.22 | **45.33** | 42.33 | **51.78** |
-| | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | **29.37** | **31.85** |
-| | Flores-200 (5-shot) | **40.22** | 39.44 | **44.43** | 35.82 | 39.76 |
-| Chinese| Belebele | 43.78 | 44.56 | **49.56** | 48.78 | **53.22** |
-| | Global MMLU (CF) | 36.16 | 33.79 | **39.57** | 38.56 | **44.55** |
-| | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | **32.50** |
-| Russian| Belebele | **47.44** | 45.89 | **47.44** | 45.22 | **51.44** |
-| | Global MMLU (CF) | **36.51** | 32.47 | 34.52 | 34.83 | **38.80** |
-| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | **54.70** | **60.53** |
 ## Instruction Model
@@ -143,25 +145,25 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
 Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
 | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
 |---------|--------|------------|------------|-------------|------------|----------|
-| High school math competition | AIME 2025 | **9.3** | 2.9 | 0.3 | 8.0 | **17.1** |
-| Math problem-solving | GSM-Plus | 72.8 | **74.1** | 59.2 | 68.3 | **82.1** |
-| Competitive programming | LiveCodeBench v4 | **15.2** | 10.5 | 3.4 | 15.0 | **24.9** |
-| Graduate-level reasoning | GPQA Diamond | **35.7** | 32.2 | 29.4 | 31.8 | **44.4** |
-| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | **74.0** | 68.9 |
-| Alignment | MixEval Hard | 26.9 | **27.6** | 24.9 | 24.3 | **31.6** |
-| Multilingual Q&A | Global MMLU | **53.5** | - | 46.8 | 49.5 | **65.1** |
 ### Extended Thinking
 Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
 | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
 |---------|--------|------------|------------|----------|
-| High school math competition | AIME 2025 | **36.7** | 30.7 | **58.8** |
-| Math problem-solving | GSM-Plus | **83.4** | 79.4 | **88.2** |
-| Competitive programming | LiveCodeBench v4 | 30.0 | **34.4** | **52.9** |
-| Graduate-level reasoning | GPQA Diamond | **41.7** | 39.9 | **55.3** |
-| Instruction following | IFEval | 71.2 | **74.2** | **85.4** |
-| Alignment | MixEval Hard | 30.8 | **33.9** | **38.0** |
-| Multilingual Q&A | Global MMLU | **64.1** | 62.3 | **73.3** |
 ## Training

 In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
+We highlight the best score in bold and underline the second-best score.
 ## Base Pre-Trained Model
 ### English benchmarks
 | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
+| Reasoning & Commonsense| HellaSwag | **76.15** | 74.19 |<u>75.52</u> | 60.52 | 74.37 |
+| | ARC-CF (Average) | **65.61** | 59.81 | 58.58 | 55.88 | <u>62.11</u> |
+| | Winogrande | 58.88 | **61.41** | 58.72 | 57.06 | <u>59.59</u> |
+| | CommonsenseQA | <u>55.28</u> | 49.14 | **60.60** | 48.98 | 52.99 |
+| Knowledge & Understanding | MMLU-CF (Average) | <u>44.13</u> | 42.93 | 41.32 | 39.11 | **47.65** |
+| | MMLU Pro CF | <u>19.61</u> | 16.66 | 16.42 | 18.04 | **24.92** |
+| | MMLU Pro MCF | <u>32.70</u> | 31.32 | 25.07 | 30.39 | **41.07** |
+| | PIQA | **78.89** | 78.35 | <u>78.51</u> | 75.35 | 77.58 |
+| | OpenBookQA | 40.60 | 40.20 | <u>42.00</u> | 36.40 | **42.40** |
+| | BoolQ | **78.99** | 73.61 | <u>75.33</u> | 74.46 | 74.28 |
 | **Math & Code** |  |  |  |  |  |  |
+| Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | <u>43.29</u>| **54.87** |
+| | MBPP+ | 52.91 | 52.11 | 38.88| <u>59.25</u> | **63.75** |
+| | MATH (4-shot) | <u>46.10</u> | 40.10 | 7.44 | 41.64 | **51.20** |
+| | GSM8k (5-shot) | 67.63 | <u>70.13</u> | 25.92 | 65.88 | **74.14** |
 | **Long context** |  |  |  |  |  |  |
+| | Ruler 32k context | 76.35 | 75.93 | <u>77.58</u> | 70.63 | **83.98** |
 | | Ruler 64k context | 67.85 | 64.90 | **72.93** | 57.18 | 60.29 |
 ### Multilingual benchmarks
 | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
 | Main supported languages |  |  |  |  |  |  |  |
+| French| MLMM Hellaswag | **63.94** | 57.47 | 57.66 | 51.26 | <u>61.00</u> |
+| | Belebele | 51.00 | <u>51.55</u> | 49.22 |49.44| **55.00** |
+| | Global MMLU (CF) | <u>38.37</u> | 34.22  | 33.71 | 34.94  |**41.80** |
+| | Flores-200 (5-shot) | 62.85| 61.38| <u>62.89<u/u> | 58.68 | **65.76** |
+| Spanish| MLMM Hellaswag | **65.85** | 58.25 | 59.39 | 52.40 | <u>61.85</u> |
+| | Belebele | 47.00 | <u>48.88</u> | 47.00 | 47.56 | **50.33** |
+| | Global MMLU (CF) | <u>38.51</u> | 35.84  | 35.60 | 34.79  |**41.22** |
+| | Flores-200 (5-shot) | <u>48.25</u>| 50.00| 44.45 | 46.93 | **50.16** |
+| German| MLMM Hellaswag | **59.56** | 49.99|  53.19|46.10| <u>56.43</u>|
+| | Belebele | <u>48.44</u> | 47.88 | 46.22 | 48.00 | **53.44**|
+| | Global MMLU (CF) | <u>35.10</u> | 33.19  | 32.60 | 32.73  |**38.70** |
+| | Flores-200 (5-shot) | **56.60**| 50.63| <u>54.95</u> | 52.58 | 50.48 |
+| Italian| MLMM Hellaswag | **62.49** | 53.21 | 54.96 | 48.72 | <u>58.76</u> |
+| | Belebele | <u>46.44</u> | 44.77 | 43.88 | 44.00 | **48.78** | 44.88 |
+| | Global MMLU (CF) | <u>36.99</u> | 33.91  | 32.79 | 35.37  |**39.26** |
+| | Flores-200 (5-shot) | <u>52.65<u/>| **54.87**| 48.83 | 48.37 | 49.11 |
+| Portuguese| MLMM Hellaswag | **63.22** | 57.38 | 56.84 | 50.73 | <u>59.89</u> |
+| | Belebele | 47.67 | **49.22** | 45.00 | 44.00 | 50.00 | <u>49.00</U> |
+| | Global MMLU (CF) | <u>36.88</u> | 34.72  | 33.05 | 35.26  |**40.66** |
+| | Flores-200 (5-shot) | <u>60.93</u> |57.68| 54.28 | 56.58 | **63.43** |
 The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
 | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
 | Other supported languages |  |  |  |  |  |  |  |
+| Arabic| Belebele | 40.22 | 44.22 | <u>45.33</u> | 42.33 | **51.78** |
+| | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | <u>29.37</u> | **31.85** |
+| | Flores-200 (5-shot) | <u>40.22</u> | 39.44 | **44.43** | 35.82 | 39.76 |
+| Chinese| Belebele | 43.78 | 44.56 | <u>49.56</u> | 48.78 | **53.22** |
+| | Global MMLU (CF) | 36.16 | 33.79 | <u>39.57</u> | 38.56 | **44.55** |
+| | Flores-200 (5-shot) | 29.17 | **33.21** | 31.89 | 25.70 | <u>32.50</u> |
+| Russian| Belebele | <u>47.44</u> | 45.89 | <u>47.44</u> | 45.22 | **51.44** |
+| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
+| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
 ## Instruction Model
 Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
 | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
 |---------|--------|------------|------------|-------------|------------|----------|
+| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
+| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
+| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
+| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
+| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
+| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
+| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
 ### Extended Thinking
 Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
 | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
 |---------|--------|------------|------------|----------|
+| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
+| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
+| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
+| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
+| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
+| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
+| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
 ## Training