JonathanMiddleton commited on
Commit
f119a93
·
verified ·
1 Parent(s): d4a5488

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +131 -0
  2. chat_template.jinja +50 -0
  3. special_tokens_map.json +164 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +199 -0
README.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ tags:
7
+ - tokenizer
8
+ - bpe
9
+ - byte-level
10
+ - chatml
11
+ - tool-use
12
+ - code
13
+ - python
14
+ pipeline_tag: text-generation
15
+ datasets:
16
+ - nvidia/Nemotron-CC-HQ
17
+ - HuggingFaceTB/smoltalk
18
+ - sahil2801/CodeAlpaca-20k
19
+ ---
20
+
21
+ # Daisy Tokenizer v2
22
+
23
+ Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.
24
+
25
+ ## Details
26
+
27
+ | Property | Value |
28
+ |----------|-------|
29
+ | **Vocabulary size** | 49,152 |
30
+ | **Algorithm** | Byte-level BPE |
31
+ | **Pre-tokenizer** | Llama-3 style regex |
32
+ | **Chat format** | ChatML |
33
+ | **Max length** | 131,072 tokens |
34
+ | **Training date** | 2026-01-14 |
35
+
36
+ ## Features
37
+
38
+ - **Python-optimized**: Trained on Python code for efficient tokenization
39
+ - **Tool calling**: Native support for `<|tool_call|>` / `<|tool_result|>` patterns
40
+ - **Inline computation**: Support for `<|python|>` / `<|output|>` for calculator-style reasoning
41
+ - **Chain-of-thought**: `<|think|>` tokens for reasoning blocks
42
+ - **No UNK tokens**: Byte-level fallback handles any Unicode input
43
+
44
+ ## Special Tokens
45
+
46
+ | Token | ID | Purpose |
47
+ |----------------------|-------|----------------------------|
48
+ | `<\|endoftext\|>` | 49131 | End of sequence / BOS |
49
+ | `<\|pad\|>` | 49132 | Padding token |
50
+ | `<\|im_start\|>` | 49133 | Start of message (ChatML) |
51
+ | `<\|im_end\|>` | 49134 | End of message (ChatML) |
52
+ | `<\|tool_call\|>` | 49135 | Start of tool call |
53
+ | `<\|/tool_call\|>` | 49136 | End of tool call |
54
+ | `<\|tool_result\|>` | 49137 | Start of tool result |
55
+ | `<\|/tool_result\|>` | 49138 | End of tool result |
56
+ | `<\|python\|>` | 49139 | Start of Python expression |
57
+ | `<\|/python\|>` | 49140 | End of Python expression |
58
+ | `<\|output\|>` | 49141 | Start of computed output |
59
+ | `<\|/output\|>` | 49142 | End of computed output |
60
+ | `<\|think\|>` | 49143 | Start of reasoning block |
61
+ | `<\|/think\|>` | 49144 | End of reasoning block |
62
+ | `<\|system\|>` | 49145 | System role marker |
63
+ | `<\|user\|>` | 49146 | User role marker |
64
+ | `<\|assistant\|>` | 49147 | Assistant role marker |
65
+ | `<\|reserved_0\|>` | 49148 | Reserved |
66
+ | `<\|reserved_1\|>` | 49149 | Reserved |
67
+ | `<\|reserved_2\|>` | 49150 | Reserved |
68
+ | `<\|reserved_3\|>` | 49151 | Reserved |
69
+
70
+ ## Usage
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy-tokenizer-v2")
76
+
77
+ # Basic encoding
78
+ tokens = tokenizer.encode("Hello, world!")
79
+
80
+ # Chat formatting
81
+ messages = [
82
+ {"role": "system", "content": "You are a helpful assistant."},
83
+ {"role": "user", "content": "Hello!"},
84
+ {"role": "assistant", "content": "Hi there! How can I help you?"},
85
+ ]
86
+ text = tokenizer.apply_chat_template(messages, tokenize=False)
87
+ ```
88
+
89
+ ## Chat Template Format
90
+
91
+ ```
92
+ <|im_start|>system
93
+ {system_message}<|im_end|>
94
+ <|im_start|>user
95
+ {user_message}<|im_end|>
96
+ <|im_start|>assistant
97
+ {assistant_message}<|im_end|>
98
+ ```
99
+
100
+ ### Tool Calling Example
101
+
102
+ ```
103
+ <|im_start|>assistant
104
+ Let me calculate that for you.
105
+ <|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
106
+ <|tool_result|>4<|/tool_result|>
107
+ The answer is 4.<|im_end|>
108
+ ```
109
+
110
+ ## Compression Ratios
111
+
112
+ | Content Type | Chars/Token | vs GPT-2 |
113
+ |--------------|-------------|----------|
114
+ | English prose | ~4.0 | baseline |
115
+ | Python code | ~3.8 | +15% better |
116
+
117
+ Run validation to see detailed compression ratios:
118
+ ```bash
119
+ python tools/validate_tokenizer.py --tokenizer tokenizer/daisy-v2
120
+ ```
121
+
122
+ ## Training Data
123
+
124
+ - **General text**: nvidia/Nemotron-CC-HQ (~60%)
125
+ - **Python code**: bigcode/the-stack-dedup, CodeAlpaca (~25%)
126
+ - **Instructions**: HuggingFaceTB/smoltalk, OpenHermes (~15%)
127
+
128
+ ## License
129
+
130
+ Apache 2.0
131
+
chat_template.jinja ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {#- Daisy Chat Template v2 -#}
2
+ {#- Supports: ChatML format, tool calling, multipart content -#}
3
+
4
+ {#- Macro to render content (string or multipart) -#}
5
+ {%- macro render_content(content) -%}
6
+ {%- if content is string -%}
7
+ {{ content }}
8
+ {%- elif content is iterable -%}
9
+ {%- for part in content -%}
10
+ {%- if part.type == 'text' -%}
11
+ {{ part.text }}
12
+ {%- elif part.type == 'tool_call' -%}
13
+ <|tool_call|>{{ part.text }}<|/tool_call|>
14
+ {%- elif part.type == 'tool_result' -%}
15
+ <|tool_result|>{{ part.text }}<|/tool_result|>
16
+ {%- elif part.type == 'python' -%}
17
+ <|python|>{{ part.text }}<|/python|>
18
+ {%- elif part.type == 'output' -%}
19
+ <|output|>{{ part.text }}<|/output|>
20
+ {%- elif part.type == 'think' -%}
21
+ <|think|>{{ part.text }}<|/think|>
22
+ {%- endif -%}
23
+ {%- endfor -%}
24
+ {%- else -%}
25
+ {{ content }}
26
+ {%- endif -%}
27
+ {%- endmacro -%}
28
+
29
+ {#- Main message loop -#}
30
+ {%- for message in messages -%}
31
+ {%- if message.role == 'system' -%}
32
+ <|im_start|>system
33
+ {{ message.content }}<|im_end|>
34
+ {% elif message.role == 'user' -%}
35
+ <|im_start|>user
36
+ {{ message.content }}<|im_end|>
37
+ {% elif message.role == 'assistant' -%}
38
+ <|im_start|>assistant
39
+ {% generation %}{{ render_content(message.content) }}{% endgeneration %}<|im_end|>
40
+ {% elif message.role == 'tool' -%}
41
+ <|tool_result|>{{ message.content }}<|/tool_result|>
42
+ {%- endif -%}
43
+ {%- endfor -%}
44
+
45
+ {#- Generation prompt -#}
46
+ {%- if add_generation_prompt -%}
47
+ <|im_start|>assistant
48
+ {% generation %}{% endgeneration %}
49
+ {%- endif -%}
50
+
special_tokens_map.json ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false,
8
+ "special": true
9
+ },
10
+ "eos_token": {
11
+ "content": "<|endoftext|>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false,
16
+ "special": true
17
+ },
18
+ "pad_token": {
19
+ "content": "<|pad|>",
20
+ "lstrip": false,
21
+ "normalized": false,
22
+ "rstrip": false,
23
+ "single_word": false,
24
+ "special": true
25
+ },
26
+ "additional_special_tokens": [
27
+ {
28
+ "content": "<|tool_call|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ {
36
+ "content": "<|/tool_call|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ {
44
+ "content": "<|tool_result|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ {
52
+ "content": "<|/tool_result|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ {
60
+ "content": "<|python|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ {
68
+ "content": "<|/python|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ {
76
+ "content": "<|output|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ {
84
+ "content": "<|/output|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ {
92
+ "content": "<|think|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ {
100
+ "content": "<|/think|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ {
108
+ "content": "<|system|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ {
116
+ "content": "<|user|>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ {
124
+ "content": "<|assistant|>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ {
132
+ "content": "<|reserved_0|>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ {
140
+ "content": "<|reserved_1|>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ {
148
+ "content": "<|reserved_2|>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ {
156
+ "content": "<|reserved_3|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ }
163
+ ]
164
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "49131": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "49132": {
13
+ "content": "<|pad|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "49133": {
21
+ "content": "<|im_start|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "49134": {
29
+ "content": "<|im_end|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "49135": {
37
+ "content": "<|tool_call|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "49136": {
45
+ "content": "<|/tool_call|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "49137": {
53
+ "content": "<|tool_result|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "49138": {
61
+ "content": "<|/tool_result|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "49139": {
69
+ "content": "<|python|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "49140": {
77
+ "content": "<|/python|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "49141": {
85
+ "content": "<|output|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "49142": {
93
+ "content": "<|/output|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "49143": {
101
+ "content": "<|think|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "49144": {
109
+ "content": "<|/think|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "49145": {
117
+ "content": "<|system|>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "49146": {
125
+ "content": "<|user|>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "49147": {
133
+ "content": "<|assistant|>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "49148": {
141
+ "content": "<|reserved_0|>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "49149": {
149
+ "content": "<|reserved_1|>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "49150": {
157
+ "content": "<|reserved_2|>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": true
163
+ },
164
+ "49151": {
165
+ "content": "<|reserved_3|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": true
171
+ }
172
+ },
173
+ "additional_special_tokens": [
174
+ "<|tool_call|>",
175
+ "<|/tool_call|>",
176
+ "<|tool_result|>",
177
+ "<|/tool_result|>",
178
+ "<|python|>",
179
+ "<|/python|>",
180
+ "<|output|>",
181
+ "<|/output|>",
182
+ "<|think|>",
183
+ "<|/think|>",
184
+ "<|system|>",
185
+ "<|user|>",
186
+ "<|assistant|>",
187
+ "<|reserved_0|>",
188
+ "<|reserved_1|>",
189
+ "<|reserved_2|>",
190
+ "<|reserved_3|>"
191
+ ],
192
+ "bos_token": "<|endoftext|>",
193
+ "eos_token": "<|endoftext|>",
194
+ "pad_token": "<|pad|>",
195
+ "unk_token": null,
196
+ "clean_up_tokenization_spaces": false,
197
+ "model_max_length": 131072,
198
+ "tokenizer_class": "PreTrainedTokenizerFast"
199
+ }