QiTianTokenizer-XLarge

QiTianTokenizer is a universal multilingual tokenizer primarily optimized for Chinese–English mixed text,
offering consistent and reversible tokenization across diverse languages and scripts.
It is designed as a general-purpose tokenizer, not tied to any specific model,
and fully compatible with the 🤗 Transformers ecosystem.


✨ Overview

Property Value
Name QiTianTokenizer-XLarge
Type Tokenizer-only repository
Purpose General multilingual tokenization
Primary Languages Chinese, English
Extended Support Multilingual (Unicode-complete)
Architecture Byte-level BPE
Vocabulary Size 128,000 tokens
Fast Implementation ✅ Available (QiTianTokenizerFast)
Framework 🤗 transformers
License Apache 2.0

🧩 QiTian Tokenizer Series

Variant Vocabulary Size Description Recommended Use
QiTianTokenizer-Tiny 12k Lightweight tokenizer designed for compact or embedded models. On-device or low-resource tasks
QiTianTokenizer-Base 32k Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases. Recommended for general use
QiTianTokenizer-Medium 64k Optimal balance in language coverage — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity. Recommended for multilingual and high-quality general-purpose models
QiTianTokenizer-Large 96k Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models. High-resource training
QiTianTokenizer-XLarge 128k Full-script and domain-extensive vocabulary for comprehensive multilingual modeling. Research & large-scale pretraining

All variants share consistent token definitions, special tokens, and compatible configurations.


⚙️ Usage

You can load this tokenizer directly with AutoTokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

# Example
text = "你好,QiTian!"
tokens = tokenizer(text)
print(tokens["input_ids"])

➕ Batch Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

# Example
texts = ["Hello, 世界!", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])

💬 Chat Template (apply_chat_template)

For chat-style data, you can format a list of messages using apply_chat_template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-XLarge", trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "你好,介绍一下 QiTianTokenizer。"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
print(text)

# If you need token ids directly:
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    enable_thinking=False,
    return_tensors="pt",
)
print(inputs["input_ids"])

Parameters

  • add_generation_prompt

    • True: append the assistant role token (e.g. <|assistant|>) at the end, so the model can continue generating.
    • False: do not append generation prompt (useful for evaluating full dialogues).
  • enable_thinking

    • True: wrap the assistant part with a thinking span (e.g. <|begin_of_think|> ... <|end_of_think|>), if your training/inference uses it.
    • False: keep plain assistant content without the thinking wrapper.

📦 Files Included

File Description
tokenizer.json Serialized fast tokenizer definition
tokenizer_config.json Configuration (max length, padding side, etc.)
tokenizer.py Tokenizer implementation

🔍 Special Tokens

Token Purpose
<|bos|> Beginning of sequence
<|eos|> End of sequence
<|eot|> End of turn (marks message boundary)
<|pad|> Padding token for batch alignment
<|mask|> Masked token for MLM-style objectives
<|system|> Defines system or meta-instruction context
<|user|> Marks user message boundary in conversational data
<|assistant|> Marks assistant message boundary
<|begin_of_think|> Begin internal reasoning span
<|end_of_think|> End internal reasoning span

🔖 License

This tokenizer and vocabulary are released under the Apache License 2.0. You are free to use, modify, and redistribute it under the same license terms.


📚 Citation

If you use QiTianTokenizer in your research or project, please cite it as:

@misc{QiTianTokenizer,
  title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
  author = {Morton Li},
  year   = {2026},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support