QiYuanTokenizer-Small

QiYuanTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering compact and efficient tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model family,
and is especially suitable for encoder and encoder-decoder architectures.

✨ Overview

Property	Value
Name	QiYuanTokenizer-Small
Type	Tokenizer-only repository
Purpose	General multilingual tokenization
Primary Languages	Chinese, English
Extended Support	Multilingual (Unicode-complete)
Architecture	Unigram
Vocabulary Size	24,000 tokens
Fast Implementation	✅ Available (`QiYuanTokenizerFast`)
Framework	🤗 `transformers`
License	Apache 2.0

🧩 QiYuan Tokenizer Series

Variant	Vocabulary Size	Description	Recommended Use
QiYuanTokenizer-Tiny	12k	Extremely compact vocabulary for highly constrained settings. Efficient, but may become limiting for more demanding multilingual scenarios.	Use with caution
QiYuanTokenizer-Small	24k	A lightweight tokenizer with improved coverage over Tiny while still keeping vocabulary size modest.	Compact models and efficiency-oriented experiments
QiYuanTokenizer-Base	32k	A balanced baseline vocabulary suitable for general bilingual and multilingual tokenization tasks.	Recommended for general use
QiYuanTokenizer-Medium	48k	The best-balanced variant in the series, providing strong coverage and good compression while keeping model complexity reasonable.	Recommended balance choice
QiYuanTokenizer-Large	64k	A larger vocabulary designed for quality-oriented training, offering better coverage and stronger tokenization fidelity.	Recommended when quality is prioritized

All variants share the same core token definitions and compatible special token settings.

⚙️ Usage

You can load this tokenizer directly with AutoTokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiYuanTokenizer-Small", trust_remote_code=True)

# Example
text = "你好，QiYuan！"
tokens = tokenizer(text)
print(tokens["input_ids"])

➕ Batch Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiYuanTokenizer-Small", trust_remote_code=True)

# Example
texts = [
    "Hello, 世界！",
    "QiYuanTokenizer is designed for multilingual tokenization."
]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])

🧠 Design Notes

QiYuanTokenizer adopts the Unigram algorithm and is intended as a practical tokenizer for general text understanding and sequence transformation tasks.

In practice, it is generally more suitable for:

Encoder models, such as text classification, embedding, retrieval, and sequence labeling
Encoder-decoder models, such as translation, summarization, and text transformation

It can still be used in broader settings, but its design is not primarily oriented toward chat-format tokenization or decoder-only conversational templates.

📦 Files Included

File	Description
`tokenizer.json`	Serialized fast tokenizer definition
`tokenizer_config.json`	Configuration (max length, padding side, etc.)
`tokenizer.py`	Tokenizer implementation
`tokenizer.model`	SentencePiece model file trained with the Unigram algorithm
`tokenizer.vocab`	SentencePiece vocabulary file corresponding to `tokenizer.model`

🔍 Special Tokens

Token	Purpose
`<\|unk\|>`	Unknown token
`<\|bos\|>`	Beginning of sequence
`<\|eos\|>`	End of sequence
`<\|pad\|>`	Padding token for batch alignment
`<\|mask\|>`	Masked token for MLM-style objectives

🔖 License

This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.

📚 Citation

If you use QiYuanTokenizer in your research or project, please cite it as:

@misc{QiYuanTokenizer,
  title  = {QiYuanTokenizer: A Universal Multilingual Unigram Tokenizer with Chinese-English Optimization},
  author = {Morton Li},
  year   = {2026},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support