Version iteration

Browse files

Files changed (5) hide show

README.md +56 -14
chat_template.jinja +8 -7
special_tokens_map.json +0 -36
tokenizer.json +0 -0
tokenizer_config.json +18 -79

README.md CHANGED Viewed

@@ -79,6 +79,49 @@ batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
 print(batch_tokens["input_ids"])
 ```
 ---
 ## 📦 Files Included
@@ -87,25 +130,24 @@ print(batch_tokens["input_ids"])
 |---------------------------|------------------------------------------------|
 | `tokenizer.json`          | Serialized fast tokenizer definition           |
 | `tokenizer_config.json`   | Configuration (max length, padding side, etc.) |
-| `special_tokens_map.json` | Special token mappings                         |
 | `tokenizer.py`            | Tokenizer implementation                       |
 ---
 ## 🔍 Special Tokens
-| Token             | Example           | Purpose                                                                                                      |
-|-------------------|-------------------|--------------------------------------------------------------------------------------------------------------|
-| `<\|bos\|>`       | `<\|bos\|>`       | Beginning of sequence (BOS)                                                                                  |
-| `<\|eos\|>`       | `<\|eos\|>`       | End of sequence (EOS)                                                                                        |
-| `<\|pad\|>`       | `<\|pad\|>`       | Padding token for batch alignment                                                                            |
-| `<\|mask\|>`      | `<\|mask\|>`      | Masked token for MLM-style objectives                                                                        |
-| `<\|user\|>`      | `<\|user\|>`      | Marks user message boundary in conversational data                                                           |
-| `<\|assistant\|>` | `<\|assistant\|>` | Marks assistant message boundary                                                                             |
-| `<\|system\|>`    | `<\|system\|>`    | Defines system or meta-instruction context                                                                   |
-| `<\|think\|>`     | `<\|think\|>`     | Reasoning-phase delimiter — marks model’s internal reasoning or structured thinking segment during inference |
-> All tokens are integrated into the tokenizer vocabulary and appear in `additional_special_tokens`.
 ---
@@ -124,6 +166,6 @@ If you use **QiTianTokenizer** in your research or project, please cite it as:
 @misc{QiTianTokenizer,
   title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
   author = {Morton Li},
-  year   = {2025},
 }
 ```

 print(batch_tokens["input_ids"])
 ```
+### 💬 Chat Template (`apply_chat_template`)
+For chat-style data, you can format a list of messages using `apply_chat_template`:
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Medium", trust_remote_code=True)
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "你好，介绍一下 QiTianTokenizer。"},
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False,
+)
+print(text)
+# If you need token ids directly:
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    enable_thinking=False,
+    return_tensors="pt",
+)
+print(inputs["input_ids"])
+```
+**Parameters**
+- `add_generation_prompt`
+  - `True`: append the assistant role token (e.g. `<|assistant|>`) at the end, so the model can continue generating.
+  - `False`: do not append generation prompt (useful for evaluating full dialogues).
+- `enable_thinking`
+  - `True`: wrap the assistant part with a thinking span (e.g. `<|begin_of_think|> ... <|end_of_think|>`), if your training/inference uses it.
+  - `False`: keep plain assistant content without the thinking wrapper.
 ---
 ## 📦 Files Included
 |---------------------------|------------------------------------------------|
 | `tokenizer.json`          | Serialized fast tokenizer definition           |
 | `tokenizer_config.json`   | Configuration (max length, padding side, etc.) |
 | `tokenizer.py`            | Tokenizer implementation                       |
 ---
 ## 🔍 Special Tokens
+| Token                  | Purpose                                            |
+|------------------------|----------------------------------------------------|
+| `<\|bos\|>`            | Beginning of sequence                              |
+| `<\|eos\|>`            | End of sequence                                    |
+| `<\|eot\|>`            | End of turn (marks message boundary)               |
+| `<\|pad\|>`            | Padding token for batch alignment                  |
+| `<\|mask\|>`           | Masked token for MLM-style objectives              |
+| `<\|system\|>`         | Defines system or meta-instruction context         |
+| `<\|user\|>`           | Marks user message boundary in conversational data |
+| `<\|assistant\|>`      | Marks assistant message boundary                   |
+| `<\|begin_of_think\|>` | Begin internal reasoning span                      |
+| `<\|end_of_think\|>`   | End internal reasoning span                        |
 ---
 @misc{QiTianTokenizer,
   title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
   author = {Morton Li},
+  year   = {2026},
 }
 ```

chat_template.jinja CHANGED Viewed

@@ -1,8 +1,9 @@
-{% for message in messages %}<|{{ message['role'] }}|>:
-{{ bos_token }}{{ message['content'] }}{{ eos_token }}{% if not loop.last %}
 {% endif %}
-{% endfor %}{% if add_generation_prompt %}
-<|assistant|>:
-{{ bos_token }}{% endif %}

+{{ bos_token }}
+{% for message in messages -%}
+    <|{{ message.role }}|>{{ message.content }}<|eot|>
+    {%- if not loop.last -%}{{ '\n' }}{% endif %}
+{% endfor %}
+{% if add_generation_prompt -%}
+    {{ '\n' }}<|assistant|>
+    {%- if enable_thinking %}{{ '\n' }}<|begin_of_think|>{% endif %}
 {% endif %}

special_tokens_map.json DELETED Viewed

@@ -1,36 +0,0 @@
-{
-  "additional_special_tokens": [
-    "<|user|>",
-    "<|assistant|>",
-    "<|think|>",
-    "<|system|>"
-  ],
-  "bos_token": {
-    "content": "<|bos|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "<|eos|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "mask_token": {
-    "content": "<|mask|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "<|pad|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1,86 +1,25 @@
 {
-  "added_tokens_decoder": {
-    "0": {
-      "content": "<|bos|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "<|eos|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "<|pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "3": {
-      "content": "<|mask|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "4": {
-      "content": "<|user|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "5": {
-      "content": "<|assistant|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "6": {
-      "content": "<|think|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "7": {
-      "content": "<|system|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "additional_special_tokens": [
     "<|user|>",
     "<|assistant|>",
-    "<|think|>",
-    "<|system|>"
   ],
-  "auto_map": {
-    "AutoTokenizer": [
-      null,
-      "tokenizer.QiTianTokenizerFast"
-    ]
-  },
-  "bos_token": "<|bos|>",
-  "clean_up_tokenization_spaces": false,
-  "eos_token": "<|eos|>",
-  "extra_special_tokens": {},
   "mask_token": "<|mask|>",
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "<|pad|>",

 {
+  "backend": "tokenizers",
+  "bos_token": "<|bos|>",
+  "eos_token": "<|eos|>",
+  "extra_special_tokens": [
+    "<|eot|>",
+    "<|system|>",
     "<|user|>",
     "<|assistant|>",
+    "<|begin_of_think|>",
+    "<|end_of_think|>",
+    "<|placeholder_0|>",
+    "<|placeholder_1|>",
+    "<|placeholder_2|>",
+    "<|placeholder_3|>",
+    "<|placeholder_4|>",
+    "<|placeholder_5|>",
+    "<|placeholder_6|>",
+    "<|placeholder_7|>",
+    "<|placeholder_8|>",
+    "<|placeholder_9|>"
   ],
   "mask_token": "<|mask|>",
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "<|pad|>",