|
|
--- |
|
|
license: llama2 |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# CJK Tokenizer |
|
|
|
|
|
A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using **character-level** tokenization instead of subword-level tokenization on the semantic understanding of CJK text. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Extended Vocabulary** |
|
|
- Base: [LLaMA tokenizer](https://github.com/meta-llama/llama) |
|
|
- Added: 16,689 unique CJK characters covering the full Unicode CJK blocks |
|
|
- **Character-Level** |
|
|
- Each CJK character is treated as its own token |
|
|
- Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme |
|
|
|
|
|
## Installation |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer") |
|
|
|
|
|
text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。" |
|
|
tokens = tokenizer(text) |
|
|
print(tokens.tokens) |
|
|
``` |
|
|
|
|
|
|
|
|
Tokenizer Details |
|
|
|
|
|
Property Value |
|
|
Base tokenizer LLaMA tokenizer |
|
|
Vocabulary size 16,689 |
|
|
Added tokens All CJK Unified Ideographs (Unicode) |
|
|
Special tokens Inherited from LLaMA (<unk>, <s>, </s>, etc.) |
|
|
|
|
|
Experimental Purpose |
|
|
|
|
|
This tokenizer was created for ablation studies in CJK language modeling: |
|
|
1. Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization. |
|
|
2. Method: |
|
|
• Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer. |
|
|
• Compare downstream performance on tasks such as language modeling, classification, and machine translation. |
|
|
3. Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena. |
|
|
|
|
|
Contributing |
|
|
|
|
|
Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub. |
|
|
|
|
|
License |
|
|
|
|
|
This project is licensed under the MIT License. |
|
|
|
|
|
Citation |
|
|
|
|
|
If you use this tokenizer in your research or projects, please cite: |
|
|
|
|
|
@misc{CJKCharTokenizer2025, |
|
|
author = {Your Name}, |
|
|
title = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA}, |
|
|
year = {2025}, |
|
|
howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}}, |
|
|
} |