AngelRaychev commited on
Commit
8c25236
·
verified ·
1 Parent(s): 38ac93a

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ ---
7
+
8
+
9
+ # SmolLM2
10
+
11
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/XtSR4NkriicR6fGiWGowZ.png)
12
+
13
+ ## Table of Contents
14
+
15
+ 1. [Model Summary](##model-summary)
16
+ 2. [Limitations](##limitations)
17
+ 3. [Training](##training)
18
+ 4. [License](##license)
19
+ 5. [Citation](##citation)
20
+
21
+ ## Model Summary
22
+
23
+ SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device. More details in our paper: https://arxiv.org/abs/2502.02737
24
+
25
+ SmolLM2 demonstrates significant advances over its predecessor SmolLM1, particularly in instruction following, knowledge, reasoning. The 135M model was trained on 2 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new filtered datasets we curated and will release soon. We developed the instruct version through supervised fine-tuning (SFT) using a combination of public datasets and our own curated datasets. We then applied Direct Preference Optimization (DPO) using [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized).
26
+
27
+ The instruct model additionally supports tasks such as text rewriting, summarization and function calling (for the 1.7B) thanks to datasets developed by [Argilla](https://huggingface.co/argilla) such as [Synth-APIGen-v0.1](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1).
28
+ You can find the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk and finetuning code at https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2
29
+
30
+ ### How to use
31
+
32
+ ```bash
33
+ pip install transformers
34
+ ```
35
+
36
+ #### Running the model on CPU/GPU/multi GPU
37
+ * _Using full precision_
38
+ ```python
39
+ # pip install transformers
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+ checkpoint = "HuggingFaceTB/SmolLM2-135M"
42
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
43
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
44
+ # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
45
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
46
+ inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
47
+ outputs = model.generate(inputs)
48
+ print(tokenizer.decode(outputs[0]))
49
+ ```
50
+
51
+ * _Using `torch.bfloat16`_
52
+ ```python
53
+ # pip install accelerate
54
+ import torch
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+ checkpoint = "HuggingFaceTB/SmolLM2-135M"
57
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
58
+ # for fp16 use `torch_dtype=torch.float16` instead
59
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
60
+ inputs = tokenizer.encode("Gravity is", return_tensors="pt").to("cuda")
61
+ outputs = model.generate(inputs)
62
+ print(tokenizer.decode(outputs[0]))
63
+ ```
64
+ ```bash
65
+ >>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
66
+ Memory footprint: 723.56 MB
67
+ ```
68
+
69
+ ## Evaluation
70
+
71
+ In this section, we report the evaluation results of SmolLM2. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
72
+
73
+ ## Base pre-trained model
74
+
75
+ | Metrics | SmolLM2-135M-8k | SmolLM-135M |
76
+ |:-------------------|:----------------:|:------------:|
77
+ | HellaSwag | **42.1** | 41.2 |
78
+ | ARC (Average) | **43.9** | 42.4 |
79
+ | PIQA | 68.4 | 68.4 |
80
+ | MMLU (cloze) | **31.5** | 30.2 |
81
+ | CommonsenseQA | **33.9** | 32.7 |
82
+ | TriviaQA | 4.1 | **4.3** |
83
+ | Winogrande | 51.3 | 51.3 |
84
+ | OpenBookQA | **34.6** | 34.0 |
85
+ | GSM8K (5-shot) | **1.4** | 1.0 |
86
+
87
+
88
+ ## Instruction model
89
+
90
+ | Metric | SmolLM2-135M-Instruct | SmolLM-135M-Instruct |
91
+ |:-----------------------------|:---------------------:|:--------------------:|
92
+ | IFEval (Average prompt/inst) | **29.9** | 17.2 |
93
+ | MT-Bench | **1.98** | 1.68 |
94
+ | HellaSwag | **40.9** | 38.9 |
95
+ | ARC (Average) | **37.3** | 33.9 |
96
+ | PIQA | **66.3** | 64.0 |
97
+ | MMLU (cloze) | **29.3** | 28.3 |
98
+ | BBH (3-shot) | **28.2** | 25.2 |
99
+ | GSM8K (5-shot) | 1.4 | 1.4 |
100
+
101
+
102
+
103
+ ## Limitations
104
+
105
+ SmolLM2 models primarily understand and generate content in English. They can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
106
+
107
+ ## Training
108
+
109
+ ### Model
110
+
111
+ - **Architecture:** Transformer decoder
112
+ - **Pretraining tokens:** 2T
113
+ - **Precision:** bfloat16
114
+
115
+ ### Hardware
116
+
117
+ - **GPUs:** 64 H100
118
+
119
+ ### Software
120
+
121
+ - **Training Framework:** [nanotron](https://github.com/huggingface/nanotron/tree/main)
122
+
123
+ ## License
124
+
125
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
126
+
127
+ ## Citation
128
+ ```bash
129
+ @misc{allal2025smollm2smolgoesbig,
130
+ title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model},
131
+ author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
132
+ year={2025},
133
+ eprint={2502.02737},
134
+ archivePrefix={arXiv},
135
+ primaryClass={cs.CL},
136
+ url={https://arxiv.org/abs/2502.02737},
137
+ }
138
+ ```
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 0,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 576,
11
+ "initializer_range": 0.041666666666666664,
12
+ "intermediate_size": 1536,
13
+ "is_llama_config": true,
14
+ "max_position_embeddings": 8192,
15
+ "model_type": "llama",
16
+ "num_attention_heads": 9,
17
+ "num_hidden_layers": 30,
18
+ "num_key_value_heads": 3,
19
+ "pretraining_tp": 1,
20
+ "rms_norm_eps": 1e-05,
21
+ "rope_interleaved": false,
22
+ "rope_scaling": null,
23
+ "rope_theta": 100000,
24
+ "tie_word_embeddings": true,
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.40.1",
27
+ "use_cache": true,
28
+ "vocab_size": 49152
29
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.40.1"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80521b40281d6ce74e35c9282c22539e75aa0ac8578892b2a59955ef78d55da1
3
+ size 269060552
special_tokens_map.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|im_start|>",
5
+ "<|im_end|>",
6
+ "<repo_name>",
7
+ "<reponame>",
8
+ "<file_sep>",
9
+ "<filename>",
10
+ "<gh_stars>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<jupyter_script>",
19
+ "<empty_output>"
20
+ ],
21
+ "bos_token": {
22
+ "content": "<|endoftext|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "eos_token": {
29
+ "content": "<|endoftext|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "unk_token": {
36
+ "content": "<|endoftext|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false
41
+ }
42
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ }
140
+ },
141
+ "additional_special_tokens": [
142
+ "<|endoftext|>",
143
+ "<|im_start|>",
144
+ "<|im_end|>",
145
+ "<repo_name>",
146
+ "<reponame>",
147
+ "<file_sep>",
148
+ "<filename>",
149
+ "<gh_stars>",
150
+ "<issue_start>",
151
+ "<issue_comment>",
152
+ "<issue_closed>",
153
+ "<jupyter_start>",
154
+ "<jupyter_text>",
155
+ "<jupyter_code>",
156
+ "<jupyter_output>",
157
+ "<jupyter_script>",
158
+ "<empty_output>"
159
+ ],
160
+ "bos_token": "<|endoftext|>",
161
+ "clean_up_tokenization_spaces": false,
162
+ "eos_token": "<|endoftext|>",
163
+ "model_max_length": 8192,
164
+ "tokenizer_class": "GPT2Tokenizer",
165
+ "unk_token": "<|endoftext|>",
166
+ "vocab_size": 49152
167
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff