initial commit
Browse files- README.md +141 -0
- tokenizer.json +0 -0
- tokenizer.model +3 -0
- tokenizer.py +12 -0
- tokenizer.vocab +0 -0
- tokenizer_config.json +23 -0
README.md
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
languages:
|
| 3 |
+
- zh
|
| 4 |
+
- en
|
| 5 |
+
- multilingual
|
| 6 |
+
tags:
|
| 7 |
+
- tokenizer
|
| 8 |
+
- bilingual
|
| 9 |
+
- chinese
|
| 10 |
+
- english
|
| 11 |
+
- multilingual
|
| 12 |
+
license: apache-2.0
|
| 13 |
+
---
|
| 14 |
+
# QiYuanTokenizer-Base
|
| 15 |
+
|
| 16 |
+
**QiYuanTokenizer** is a *universal multilingual tokenizer* primarily optimized for **Chinese–English mixed text**,
|
| 17 |
+
offering compact and efficient tokenization across diverse languages and scripts.
|
| 18 |
+
It is designed as a **general-purpose tokenizer**, not tied to any specific model family,
|
| 19 |
+
and is especially suitable for **encoder** and **encoder-decoder** architectures.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## ✨ Overview
|
| 24 |
+
|
| 25 |
+
| Property | Value |
|
| 26 |
+
|-------------------------|-------------------------------------|
|
| 27 |
+
| **Name** | QiYuanTokenizer-Base |
|
| 28 |
+
| **Type** | Tokenizer-only repository |
|
| 29 |
+
| **Purpose** | General multilingual tokenization |
|
| 30 |
+
| **Primary Languages** | Chinese, English |
|
| 31 |
+
| **Extended Support** | Multilingual (Unicode-complete) |
|
| 32 |
+
| **Architecture** | Unigram |
|
| 33 |
+
| **Vocabulary Size** | 32,000 tokens |
|
| 34 |
+
| **Fast Implementation** | ✅ Available (`QiYuanTokenizerFast`) |
|
| 35 |
+
| **Framework** | 🤗 `transformers` |
|
| 36 |
+
| **License** | Apache 2.0 |
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 🧩 QiYuan Tokenizer Series
|
| 41 |
+
|
| 42 |
+
| Variant | Vocabulary Size | Description | Recommended Use |
|
| 43 |
+
|---------------------------------------------------------------------------------------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|
|
| 44 |
+
| [**QiYuanTokenizer-Tiny**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Tiny) | 12k | Extremely compact vocabulary for highly constrained settings. Efficient, but may become limiting for more demanding multilingual scenarios. | **Use with caution** |
|
| 45 |
+
| [**QiYuanTokenizer-Small**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Small) | 24k | A lightweight tokenizer with improved coverage over Tiny while still keeping vocabulary size modest. | Compact models and efficiency-oriented experiments |
|
| 46 |
+
| [**QiYuanTokenizer-Base**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Base) | 32k | A balanced baseline vocabulary suitable for general bilingual and multilingual tokenization tasks. | **Recommended for general use** |
|
| 47 |
+
| [**QiYuanTokenizer-Medium**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Medium) | 48k | The best-balanced variant in the series, providing strong coverage and good compression while keeping model complexity reasonable. | **Recommended balance choice** |
|
| 48 |
+
| [**QiYuanTokenizer-Large**](https://huggingface.co/Morton-Li/QiYuanTokenizer-Large) | 64k | A larger vocabulary designed for quality-oriented training, offering better coverage and stronger tokenization fidelity. | **Recommended when quality is prioritized** |
|
| 49 |
+
|
| 50 |
+
> All variants share the same core token definitions and compatible special token settings.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## ⚙️ Usage
|
| 55 |
+
|
| 56 |
+
You can load this tokenizer directly with `AutoTokenizer`:
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from transformers import AutoTokenizer
|
| 60 |
+
|
| 61 |
+
tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiYuanTokenizer-Base", trust_remote_code=True)
|
| 62 |
+
|
| 63 |
+
# Example
|
| 64 |
+
text = "你好,QiYuan!"
|
| 65 |
+
tokens = tokenizer(text)
|
| 66 |
+
print(tokens["input_ids"])
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### ➕ Batch Example
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
from transformers import AutoTokenizer
|
| 73 |
+
|
| 74 |
+
tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiYuanTokenizer-Base", trust_remote_code=True)
|
| 75 |
+
|
| 76 |
+
# Example
|
| 77 |
+
texts = [
|
| 78 |
+
"Hello, 世界!",
|
| 79 |
+
"QiYuanTokenizer is designed for multilingual tokenization."
|
| 80 |
+
]
|
| 81 |
+
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
|
| 82 |
+
print(batch_tokens["input_ids"])
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## 🧠 Design Notes
|
| 88 |
+
|
| 89 |
+
QiYuanTokenizer adopts the **Unigram** algorithm and is intended as a practical tokenizer for **general text understanding and sequence transformation tasks**.
|
| 90 |
+
|
| 91 |
+
In practice, it is generally more suitable for:
|
| 92 |
+
|
| 93 |
+
* **Encoder models**, such as text classification, embedding, retrieval, and sequence labeling
|
| 94 |
+
* **Encoder-decoder models**, such as translation, summarization, and text transformation
|
| 95 |
+
|
| 96 |
+
It can still be used in broader settings, but its design is not primarily oriented toward chat-format tokenization or decoder-only conversational templates.
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## 📦 Files Included
|
| 101 |
+
|
| 102 |
+
| File | Description |
|
| 103 |
+
|-------------------------|------------------------------------------------------------------|
|
| 104 |
+
| `tokenizer.json` | Serialized fast tokenizer definition |
|
| 105 |
+
| `tokenizer_config.json` | Configuration (max length, padding side, etc.) |
|
| 106 |
+
| `tokenizer.py` | Tokenizer implementation |
|
| 107 |
+
| `tokenizer.model` | SentencePiece model file trained with the Unigram algorithm |
|
| 108 |
+
| `tokenizer.vocab` | SentencePiece vocabulary file corresponding to `tokenizer.model` |
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## 🔍 Special Tokens
|
| 113 |
+
|
| 114 |
+
| Token | Purpose |
|
| 115 |
+
|--------------|---------------------------------------|
|
| 116 |
+
| `<\|unk\|>` | Unknown token |
|
| 117 |
+
| `<\|bos\|>` | Beginning of sequence |
|
| 118 |
+
| `<\|eos\|>` | End of sequence |
|
| 119 |
+
| `<\|pad\|>` | Padding token for batch alignment |
|
| 120 |
+
| `<\|mask\|>` | Masked token for MLM-style objectives |
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## 🔖 License
|
| 125 |
+
|
| 126 |
+
This tokenizer and vocabulary are released under the **Apache License 2.0**.
|
| 127 |
+
You are free to use, modify, and redistribute it under the same license terms.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## 📚 Citation
|
| 132 |
+
|
| 133 |
+
If you use **QiYuanTokenizer** in your research or project, please cite it as:
|
| 134 |
+
|
| 135 |
+
```bibtex
|
| 136 |
+
@misc{QiYuanTokenizer,
|
| 137 |
+
title = {QiYuanTokenizer: A Universal Multilingual Unigram Tokenizer with Chinese-English Optimization},
|
| 138 |
+
author = {Morton Li},
|
| 139 |
+
year = {2026},
|
| 140 |
+
}
|
| 141 |
+
```
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ec688de3235a6a429f5ac7aec900386a0bd7c8f7faa9ab3e6a7294d325c3459b
|
| 3 |
+
size 747702
|
tokenizer.py
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import PreTrainedTokenizerFast
|
| 2 |
+
|
| 3 |
+
class QiYuanTokenizerFast(PreTrainedTokenizerFast):
|
| 4 |
+
""" QiYuanTokenizerFast """
|
| 5 |
+
model_input_names: list[str] = ["input_ids", "attention_mask"]
|
| 6 |
+
SPECIAL_TOKENS_ATTRIBUTES = [
|
| 7 |
+
"bos_token",
|
| 8 |
+
"eos_token",
|
| 9 |
+
"unk_token",
|
| 10 |
+
"pad_token",
|
| 11 |
+
"mask_token",
|
| 12 |
+
]
|
tokenizer.vocab
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backend": "tokenizers",
|
| 3 |
+
"bos_token": "<|bos|>",
|
| 4 |
+
"clean_up_tokenization_spaces": false,
|
| 5 |
+
"eos_token": "<|eos|>",
|
| 6 |
+
"extra_special_tokens": [
|
| 7 |
+
"<|placeholder_0|>",
|
| 8 |
+
"<|placeholder_1|>",
|
| 9 |
+
"<|placeholder_2|>",
|
| 10 |
+
"<|placeholder_3|>",
|
| 11 |
+
"<|placeholder_4|>",
|
| 12 |
+
"<|placeholder_5|>",
|
| 13 |
+
"<|placeholder_6|>",
|
| 14 |
+
"<|placeholder_7|>",
|
| 15 |
+
"<|placeholder_8|>",
|
| 16 |
+
"<|placeholder_9|>"
|
| 17 |
+
],
|
| 18 |
+
"mask_token": "<|mask|>",
|
| 19 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 20 |
+
"pad_token": "<|pad|>",
|
| 21 |
+
"tokenizer_class": "QiYuanTokenizer",
|
| 22 |
+
"unk_token": "<|unk|>"
|
| 23 |
+
}
|