Morton-Li commited on
Commit
ffd1e26
·
1 Parent(s): 00bfa26

Version iteration

Browse files
README.md CHANGED
@@ -79,6 +79,49 @@ batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
79
  print(batch_tokens["input_ids"])
80
  ```
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ---
83
 
84
  ## 📦 Files Included
@@ -87,25 +130,24 @@ print(batch_tokens["input_ids"])
87
  |---------------------------|------------------------------------------------|
88
  | `tokenizer.json` | Serialized fast tokenizer definition |
89
  | `tokenizer_config.json` | Configuration (max length, padding side, etc.) |
90
- | `special_tokens_map.json` | Special token mappings |
91
  | `tokenizer.py` | Tokenizer implementation |
92
 
93
  ---
94
 
95
  ## 🔍 Special Tokens
96
 
97
- | Token | Example | Purpose |
98
- |-------------------|-------------------|--------------------------------------------------------------------------------------------------------------|
99
- | `<\|bos\|>` | `<\|bos\|>` | Beginning of sequence (BOS) |
100
- | `<\|eos\|>` | `<\|eos\|>` | End of sequence (EOS) |
101
- | `<\|pad\|>` | `<\|pad\|>` | Padding token for batch alignment |
102
- | `<\|mask\|>` | `<\|mask\|>` | Masked token for MLM-style objectives |
103
- | `<\|user\|>` | `<\|user\|>` | Marks user message boundary in conversational data |
104
- | `<\|assistant\|>` | `<\|assistant\|>` | Marks assistant message boundary |
105
- | `<\|system\|>` | `<\|system\|>` | Defines system or meta-instruction context |
106
- | `<\|think\|>` | `<\|think\|>` | Reasoning-phase delimiter — marks model’s internal reasoning or structured thinking segment during inference |
107
-
108
- > All tokens are integrated into the tokenizer vocabulary and appear in `additional_special_tokens`.
109
 
110
  ---
111
 
@@ -124,6 +166,6 @@ If you use **QiTianTokenizer** in your research or project, please cite it as:
124
  @misc{QiTianTokenizer,
125
  title = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
126
  author = {Morton Li},
127
- year = {2025},
128
  }
129
  ```
 
79
  print(batch_tokens["input_ids"])
80
  ```
81
 
82
+ ### 💬 Chat Template (`apply_chat_template`)
83
+
84
+ For chat-style data, you can format a list of messages using `apply_chat_template`:
85
+
86
+ ```python
87
+ from transformers import AutoTokenizer
88
+
89
+ tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Medium", trust_remote_code=True)
90
+
91
+ messages = [
92
+ {"role": "system", "content": "You are a helpful assistant."},
93
+ {"role": "user", "content": "你好,介绍一下 QiTianTokenizer。"},
94
+ ]
95
+
96
+ text = tokenizer.apply_chat_template(
97
+ messages,
98
+ tokenize=False,
99
+ add_generation_prompt=True,
100
+ enable_thinking=False,
101
+ )
102
+ print(text)
103
+
104
+ # If you need token ids directly:
105
+ inputs = tokenizer.apply_chat_template(
106
+ messages,
107
+ tokenize=True,
108
+ add_generation_prompt=True,
109
+ enable_thinking=False,
110
+ return_tensors="pt",
111
+ )
112
+ print(inputs["input_ids"])
113
+ ```
114
+
115
+ **Parameters**
116
+
117
+ - `add_generation_prompt`
118
+ - `True`: append the assistant role token (e.g. `<|assistant|>`) at the end, so the model can continue generating.
119
+ - `False`: do not append generation prompt (useful for evaluating full dialogues).
120
+
121
+ - `enable_thinking`
122
+ - `True`: wrap the assistant part with a thinking span (e.g. `<|begin_of_think|> ... <|end_of_think|>`), if your training/inference uses it.
123
+ - `False`: keep plain assistant content without the thinking wrapper.
124
+
125
  ---
126
 
127
  ## 📦 Files Included
 
130
  |---------------------------|------------------------------------------------|
131
  | `tokenizer.json` | Serialized fast tokenizer definition |
132
  | `tokenizer_config.json` | Configuration (max length, padding side, etc.) |
 
133
  | `tokenizer.py` | Tokenizer implementation |
134
 
135
  ---
136
 
137
  ## 🔍 Special Tokens
138
 
139
+ | Token | Purpose |
140
+ |------------------------|----------------------------------------------------|
141
+ | `<\|bos\|>` | Beginning of sequence |
142
+ | `<\|eos\|>` | End of sequence |
143
+ | `<\|eot\|>` | End of turn (marks message boundary) |
144
+ | `<\|pad\|>` | Padding token for batch alignment |
145
+ | `<\|mask\|>` | Masked token for MLM-style objectives |
146
+ | `<\|system\|>` | Defines system or meta-instruction context |
147
+ | `<\|user\|>` | Marks user message boundary in conversational data |
148
+ | `<\|assistant\|>` | Marks assistant message boundary |
149
+ | `<\|begin_of_think\|>` | Begin internal reasoning span |
150
+ | `<\|end_of_think\|>` | End internal reasoning span |
151
 
152
  ---
153
 
 
166
  @misc{QiTianTokenizer,
167
  title = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
168
  author = {Morton Li},
169
+ year = {2026},
170
  }
171
  ```
chat_template.jinja CHANGED
@@ -1,8 +1,9 @@
1
- {% for message in messages %}<|{{ message['role'] }}|>:
2
- {{ bos_token }}{{ message['content'] }}{{ eos_token }}{% if not loop.last %}
3
-
 
 
 
 
 
4
  {% endif %}
5
- {% endfor %}{% if add_generation_prompt %}
6
-
7
- <|assistant|>:
8
- {{ bos_token }}{% endif %}
 
1
+ {{ bos_token }}
2
+ {% for message in messages -%}
3
+ <|{{ message.role }}|>{{ message.content }}<|eot|>
4
+ {%- if not loop.last -%}{{ '\n' }}{% endif %}
5
+ {% endfor %}
6
+ {% if add_generation_prompt -%}
7
+ {{ '\n' }}<|assistant|>
8
+ {%- if enable_thinking %}{{ '\n' }}<|begin_of_think|>{% endif %}
9
  {% endif %}
 
 
 
 
special_tokens_map.json DELETED
@@ -1,36 +0,0 @@
1
- {
2
- "additional_special_tokens": [
3
- "<|user|>",
4
- "<|assistant|>",
5
- "<|think|>",
6
- "<|system|>"
7
- ],
8
- "bos_token": {
9
- "content": "<|bos|>",
10
- "lstrip": false,
11
- "normalized": false,
12
- "rstrip": false,
13
- "single_word": false
14
- },
15
- "eos_token": {
16
- "content": "<|eos|>",
17
- "lstrip": false,
18
- "normalized": false,
19
- "rstrip": false,
20
- "single_word": false
21
- },
22
- "mask_token": {
23
- "content": "<|mask|>",
24
- "lstrip": false,
25
- "normalized": false,
26
- "rstrip": false,
27
- "single_word": false
28
- },
29
- "pad_token": {
30
- "content": "<|pad|>",
31
- "lstrip": false,
32
- "normalized": false,
33
- "rstrip": false,
34
- "single_word": false
35
- }
36
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,86 +1,25 @@
1
  {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "<|bos|>",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "<|eos|>",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "<|pad|>",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "<|mask|>",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "4": {
36
- "content": "<|user|>",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- },
43
- "5": {
44
- "content": "<|assistant|>",
45
- "lstrip": false,
46
- "normalized": false,
47
- "rstrip": false,
48
- "single_word": false,
49
- "special": true
50
- },
51
- "6": {
52
- "content": "<|think|>",
53
- "lstrip": false,
54
- "normalized": false,
55
- "rstrip": false,
56
- "single_word": false,
57
- "special": true
58
- },
59
- "7": {
60
- "content": "<|system|>",
61
- "lstrip": false,
62
- "normalized": false,
63
- "rstrip": false,
64
- "single_word": false,
65
- "special": true
66
- }
67
- },
68
- "additional_special_tokens": [
69
  "<|user|>",
70
  "<|assistant|>",
71
- "<|think|>",
72
- "<|system|>"
 
 
 
 
 
 
 
 
 
 
73
  ],
74
- "auto_map": {
75
- "AutoTokenizer": [
76
- null,
77
- "tokenizer.QiTianTokenizerFast"
78
- ]
79
- },
80
- "bos_token": "<|bos|>",
81
- "clean_up_tokenization_spaces": false,
82
- "eos_token": "<|eos|>",
83
- "extra_special_tokens": {},
84
  "mask_token": "<|mask|>",
85
  "model_max_length": 1000000000000000019884624838656,
86
  "pad_token": "<|pad|>",
 
1
  {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|bos|>",
4
+ "eos_token": "<|eos|>",
5
+ "extra_special_tokens": [
6
+ "<|eot|>",
7
+ "<|system|>",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  "<|user|>",
9
  "<|assistant|>",
10
+ "<|begin_of_think|>",
11
+ "<|end_of_think|>",
12
+ "<|placeholder_0|>",
13
+ "<|placeholder_1|>",
14
+ "<|placeholder_2|>",
15
+ "<|placeholder_3|>",
16
+ "<|placeholder_4|>",
17
+ "<|placeholder_5|>",
18
+ "<|placeholder_6|>",
19
+ "<|placeholder_7|>",
20
+ "<|placeholder_8|>",
21
+ "<|placeholder_9|>"
22
  ],
 
 
 
 
 
 
 
 
 
 
23
  "mask_token": "<|mask|>",
24
  "model_max_length": 1000000000000000019884624838656,
25
  "pad_token": "<|pad|>",