File size: 2,297 Bytes
2e9d2fe 3c4d717 2e9d2fe ba3bf20 2e9d2fe ba3bf20 2e9d2fe 3c4d717 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
---
license: llama2
library_name: transformers
---
# CJK Tokenizer
A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using **character-level** tokenization instead of subword-level tokenization on the semantic understanding of CJK text.
## Features
- **Extended Vocabulary**
- Base: [LLaMA tokenizer](https://github.com/meta-llama/llama)
- Added: 16,689 unique CJK characters covering the full Unicode CJK blocks
- **Character-Level**
- Each CJK character is treated as its own token
- Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
## Installation
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。"
tokens = tokenizer(text)
print(tokens.tokens)
```
Tokenizer Details
Property Value
Base tokenizer LLaMA tokenizer
Vocabulary size 16,689
Added tokens All CJK Unified Ideographs (Unicode)
Special tokens Inherited from LLaMA (<unk>, <s>, </s>, etc.)
Experimental Purpose
This tokenizer was created for ablation studies in CJK language modeling:
1. Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization.
2. Method:
• Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer.
• Compare downstream performance on tasks such as language modeling, classification, and machine translation.
3. Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena.
Contributing
Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.
License
This project is licensed under the MIT License.
Citation
If you use this tokenizer in your research or projects, please cite:
@misc{CJKCharTokenizer2025,
author = {Your Name},
title = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA},
year = {2025},
howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}},
} |