Qwen3-32B-KoTokenizer
Qwen3-32B with 3,682 Korean colloquial tokens added to the vocabulary.
Qwen3-32B's BPE tokenizer over-segments common Korean endings and particles (์ด๋ฏธ/์กฐ์ฌ) into 2-4 sub-tokens. This model adds them as single tokens, trained via QLoRA to be natively used during generation.
What was done
| Before | After | |
|---|---|---|
ํ์์ |
ํ + ์ + ์ (3 tokens) |
ํ์์ (1 token) |
๋ดค๋๋ฐ |
๋ดค + ๋๋ฐ (2 tokens) |
๋ดค๋๋ฐ (1 token) |
์ฃ์กํ์ง๋ง |
์ฃ + ์ก + ํ์ง๋ง (3 tokens) |
์ฃ์กํ์ง๋ง (1 token) |
| Vocab size | 151,669 | 155,351 (+3,682) |
The 3,682 tokens were extracted from HyperCLOVA's Korean-optimized vocabulary โ specifically endings (์ด๋ฏธ) and particles (์กฐ์ฌ) that Qwen's BPE consistently fragments.
Training
- Method: QLoRA (r=64, alpha=128) on Colab A100
- Key technique: Old embedding freeze โ gradient hook zeros out gradients for the original 151K token embeddings, forcing the optimizer to only update the 3,682 new token rows
- Data: ~77K Korean samples filtered for high new-token density (โฅ5 target tokens per sample), sourced from KoAlpaca, alpaca-gpt4-korean, KULLM-v2
- Epochs: 1 (with high-density data + freeze, convergence is fast)
- New token initialization: Mean pooling of constituent sub-token embeddings
Training curve
| Step | Loss | Accuracy |
|---|---|---|
| 50 | 1.649 | 65.6% |
| 500 | 1.186 | 70.9% |
| 1000 | 1.140 | 71.8% |
Results
New token adoption rate: 92.9% โ when the model generates text containing a string that matches a new token, it uses the single new token ID 92.9% of the time (vs. falling back to the old fragmented sub-tokens).
| Prompt | Adoption | New tokens used |
|---|---|---|
| ์ด์ ์น๊ตฌ๋ฅผ ๋ง๋ฌ๋๋ฐ ๊ฑ๊ฐ ๊ฐ์๊ธฐ... | 4/4 = 100% | ๋ฌ์ง๋ง, ๋ฌ๋ผ๊ณ , ํ์ง๋ง, ๊ฐ์ต๋๋ค |
| ์์งํ ๊ทธ๊ฑด ์ข ์๋ ๊ฒ ๊ฐ๊ฑฐ๋ ? | 1/1 = 100% | ์ข์ํ๋ |
| ํ๊ตญ์ ๊ฒฝ์ ์ฑ์ฅ์ ๋ํด ์ค๋ช ํด์ฃผ์ธ์ | 1/1 = 100% | ๋๋ฌธ |
| ์ด๊ฑฐ ์ง์ง ๋ง์๊ฑฐ๋ ? ๋๋ ํ๋ฒ ๋จน์ด๋ด | 3/3 = 100% | ๋ณด์ธ์, ์์ผ๋ฉฐ, ํ๋ค๊ณ |
| Write a Python function... | 0/0 = N/A | (no Korean tokens expected) |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_id = "2264K/Qwen3-32B-KoTokenizer"
# NF4 quantization (fits in 24GB VRAM)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Verify new tokens work
print(tokenizer.encode("ํ์์", add_special_tokens=False))
# [155305] โ single token (was 3 tokens before)
# Generate
messages = [{"role": "user", "content": "์ด์ ์น๊ตฌ๋ฅผ ๋ง๋ฌ๋๋ฐ ๊ฑ๊ฐ ๊ฐ์๊ธฐ ์ด์ํ ์๊ธฐ๋ฅผ ํ๋๋ผ๊ณ ."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Important notes
- This is a merged model (not an adapter). Load it directly like any HuggingFace model.
- The tokenizer is included. No need to load the base Qwen3-32B tokenizer separately.
- The model's generation style is unchanged from Qwen3-32B โ this modification only affects tokenization efficiency, not the model's personality or capabilities.
- English and code generation are unaffected (0 new tokens in English outputs, as expected).
Files
model-*.safetensorsโ merged model weights (bf16)tokenizer.json,tokenizer_config.jsonโ expanded tokenizertoken_expansion_metadata.jsonโ metadata for all 3,682 added tokens (token strings, IDs, source sub-token IDs used for mean pooling init)
License
Apache 2.0 (same as Qwen3-32B)
- Downloads last month
- 1
Model tree for 2264K/Qwen3-32B-KoTokenizer
Base model
Qwen/Qwen3-32B