Transformers
File size: 2,297 Bytes
2e9d2fe
3c4d717
2e9d2fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba3bf20
2e9d2fe
 
 
 
 
 
 
ba3bf20
2e9d2fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4d717
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: llama2
library_name: transformers
---

# CJK Tokenizer

A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using **character-level** tokenization instead of subword-level tokenization on the semantic understanding of CJK text.

## Features

- **Extended Vocabulary**  
  - Base: [LLaMA tokenizer](https://github.com/meta-llama/llama)  
  - Added: 16,689 unique CJK characters covering the full Unicode CJK blocks
- **Character-Level**  
  - Each CJK character is treated as its own token  
  - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme

## Installation

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")

text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。"
tokens = tokenizer(text)
print(tokens.tokens)
```


Tokenizer Details

Property	Value
Base tokenizer	LLaMA tokenizer
Vocabulary size	16,689
Added tokens	All CJK Unified Ideographs (Unicode)
Special tokens	Inherited from LLaMA (<unk>, <s>, </s>, etc.)

Experimental Purpose

This tokenizer was created for ablation studies in CJK language modeling:
	1.	Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization.
	2.	Method:
	•	Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer.
	•	Compare downstream performance on tasks such as language modeling, classification, and machine translation.
	3.	Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena.

Contributing

Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License.

Citation

If you use this tokenizer in your research or projects, please cite:

@misc{CJKCharTokenizer2025,
  author       = {Your Name},
  title        = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA},
  year         = {2025},
  howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}},
}