Morton-Li commited on
Commit
108175f
·
1 Parent(s): 180d3b9

initial commit

Browse files
Files changed (6) hide show
  1. README.md +141 -0
  2. tokenizer.json +0 -0
  3. tokenizer.model +3 -0
  4. tokenizer.py +12 -0
  5. tokenizer.vocab +0 -0
  6. tokenizer_config.json +23 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ languages:
3
+ - zh
4
+ - en
5
+ - multilingual
6
+ tags:
7
+ - tokenizer
8
+ - bilingual
9
+ - chinese
10
+ - english
11
+ - multilingual
12
+ license: apache-2.0
13
+ ---
14
+ # QiYuanTokenizer-Base
15
+
16
+ **QiYuanTokenizer** is a *universal multilingual tokenizer* primarily optimized for **Chinese–English mixed text**,
17
+ offering compact and efficient tokenization across diverse languages and scripts.
18
+ It is designed as a **general-purpose tokenizer**, not tied to any specific model family,
19
+ and is especially suitable for **encoder** and **encoder-decoder** architectures.
20
+
21
+ ---
22
+
23
+ ## ✨ Overview
24
+
25
+ | Property | Value |
26
+ |-------------------------|-------------------------------------|
27
+ | **Name** | QiYuanTokenizer-Base |
28
+ | **Type** | Tokenizer-only repository |
29
+ | **Purpose** | General multilingual tokenization |
30
+ | **Primary Languages** | Chinese, English |
31
+ | **Extended Support** | Multilingual (Unicode-complete) |
32
+ | **Architecture** | Unigram |
33
+ | **Vocabulary Size** | 32,000 tokens |
34
+ | **Fast Implementation** | ✅ Available (`QiYuanTokenizerFast`) |
35
+ | **Framework** | 🤗 `transformers` |
36
+ | **License** | Apache 2.0 |
37
+
38
+ ---
39
+
40
+ ## 🧩 QiYuan Tokenizer Series
41
+
42
+ | Variant | Vocabulary Size | Description | Recommended Use |
43
+ |---------------------------------------------------------------------------------------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|
44
+ | [**QiYuanTokenizer-Tiny**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Tiny) | 12k | Extremely compact vocabulary for highly constrained settings. Efficient, but may become limiting for more demanding multilingual scenarios. | **Use with caution** |
45
+ | [**QiYuanTokenizer-Small**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Small) | 24k | A lightweight tokenizer with improved coverage over Tiny while still keeping vocabulary size modest. | Compact models and efficiency-oriented experiments |
46
+ | [**QiYuanTokenizer-Base**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Base) | 32k | A balanced baseline vocabulary suitable for general bilingual and multilingual tokenization tasks. | **Recommended for general use** |
47
+ | [**QiYuanTokenizer-Medium**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Medium) | 48k | The best-balanced variant in the series, providing strong coverage and good compression while keeping model complexity reasonable. | **Recommended balance choice** |
48
+ | [**QiYuanTokenizer-Large**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Large) | 64k | A larger vocabulary designed for quality-oriented training, offering better coverage and stronger tokenization fidelity. | **Recommended when quality is prioritized** |
49
+
50
+ > All variants share the same core token definitions and compatible special token settings.
51
+
52
+ ---
53
+
54
+ ## ⚙️ Usage
55
+
56
+ You can load this tokenizer directly with `AutoTokenizer`:
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer
60
+
61
+ tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiYuanTokenizer-Base", trust_remote_code=True)
62
+
63
+ # Example
64
+ text = "你好,QiYuan!"
65
+ tokens = tokenizer(text)
66
+ print(tokens["input_ids"])
67
+ ```
68
+
69
+ ### ➕ Batch Example
70
+
71
+ ```python
72
+ from transformers import AutoTokenizer
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiYuanTokenizer-Base", trust_remote_code=True)
75
+
76
+ # Example
77
+ texts = [
78
+ "Hello, 世界!",
79
+ "QiYuanTokenizer is designed for multilingual tokenization."
80
+ ]
81
+ batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
82
+ print(batch_tokens["input_ids"])
83
+ ```
84
+
85
+ ---
86
+
87
+ ## 🧠 Design Notes
88
+
89
+ QiYuanTokenizer adopts the **Unigram** algorithm and is intended as a practical tokenizer for **general text understanding and sequence transformation tasks**.
90
+
91
+ In practice, it is generally more suitable for:
92
+
93
+ * **Encoder models**, such as text classification, embedding, retrieval, and sequence labeling
94
+ * **Encoder-decoder models**, such as translation, summarization, and text transformation
95
+
96
+ It can still be used in broader settings, but its design is not primarily oriented toward chat-format tokenization or decoder-only conversational templates.
97
+
98
+ ---
99
+
100
+ ## 📦 Files Included
101
+
102
+ | File | Description |
103
+ |-------------------------|------------------------------------------------------------------|
104
+ | `tokenizer.json` | Serialized fast tokenizer definition |
105
+ | `tokenizer_config.json` | Configuration (max length, padding side, etc.) |
106
+ | `tokenizer.py` | Tokenizer implementation |
107
+ | `tokenizer.model` | SentencePiece model file trained with the Unigram algorithm |
108
+ | `tokenizer.vocab` | SentencePiece vocabulary file corresponding to `tokenizer.model` |
109
+
110
+ ---
111
+
112
+ ## 🔍 Special Tokens
113
+
114
+ | Token | Purpose |
115
+ |--------------|---------------------------------------|
116
+ | `<\|unk\|>` | Unknown token |
117
+ | `<\|bos\|>` | Beginning of sequence |
118
+ | `<\|eos\|>` | End of sequence |
119
+ | `<\|pad\|>` | Padding token for batch alignment |
120
+ | `<\|mask\|>` | Masked token for MLM-style objectives |
121
+
122
+ ---
123
+
124
+ ## 🔖 License
125
+
126
+ This tokenizer and vocabulary are released under the **Apache License 2.0**.
127
+ You are free to use, modify, and redistribute it under the same license terms.
128
+
129
+ ---
130
+
131
+ ## 📚 Citation
132
+
133
+ If you use **QiYuanTokenizer** in your research or project, please cite it as:
134
+
135
+ ```bibtex
136
+ @misc{QiYuanTokenizer,
137
+ title = {QiYuanTokenizer: A Universal Multilingual Unigram Tokenizer with Chinese-English Optimization},
138
+ author = {Morton Li},
139
+ year = {2026},
140
+ }
141
+ ```
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec688de3235a6a429f5ac7aec900386a0bd7c8f7faa9ab3e6a7294d325c3459b
3
+ size 747702
tokenizer.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PreTrainedTokenizerFast
2
+
3
+ class QiYuanTokenizerFast(PreTrainedTokenizerFast):
4
+ """ QiYuanTokenizerFast """
5
+ model_input_names: list[str] = ["input_ids", "attention_mask"]
6
+ SPECIAL_TOKENS_ATTRIBUTES = [
7
+ "bos_token",
8
+ "eos_token",
9
+ "unk_token",
10
+ "pad_token",
11
+ "mask_token",
12
+ ]
tokenizer.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|bos|>",
4
+ "clean_up_tokenization_spaces": false,
5
+ "eos_token": "<|eos|>",
6
+ "extra_special_tokens": [
7
+ "<|placeholder_0|>",
8
+ "<|placeholder_1|>",
9
+ "<|placeholder_2|>",
10
+ "<|placeholder_3|>",
11
+ "<|placeholder_4|>",
12
+ "<|placeholder_5|>",
13
+ "<|placeholder_6|>",
14
+ "<|placeholder_7|>",
15
+ "<|placeholder_8|>",
16
+ "<|placeholder_9|>"
17
+ ],
18
+ "mask_token": "<|mask|>",
19
+ "model_max_length": 1000000000000000019884624838656,
20
+ "pad_token": "<|pad|>",
21
+ "tokenizer_class": "QiYuanTokenizer",
22
+ "unk_token": "<|unk|>"
23
+ }