Transformers
0xDing commited on
Commit
2e9d2fe
·
verified ·
1 Parent(s): f040328

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -1,3 +1,72 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ ---
5
+
6
+ # CJK Tokenizer
7
+
8
+ A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using **character-level** tokenization instead of subword-level tokenization on the semantic understanding of CJK text.
9
+
10
+ ## Features
11
+
12
+ - **Extended Vocabulary**
13
+ - Base: [LLaMA tokenizer](https://github.com/meta-llama/llama)
14
+ - Added: 16,689 unique CJK characters covering the full Unicode CJK blocks
15
+ - **Character-Level**
16
+ - Each CJK character is treated as its own token
17
+ - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
18
+ - **Hugging Face Integration**
19
+ - Fully compatible with the `transformers` library
20
+ - Supports `from_pretrained`, `save_pretrained`, and standard tokenizer methods
21
+
22
+ ## Installation
23
+
24
+
25
+ from transformers import AutoTokenizer
26
+
27
+ # Load the tokenizer by repository name or local path
28
+ tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
29
+
30
+ # Tokenize some CJK text
31
+ text = "天地玄黃, 宇宙洪荒。 日月盈昃, 辰宿列張。"
32
+ tokens = tokenizer(text)
33
+ print(tokens.tokens)
34
+
35
+
36
+
37
+ Tokenizer Details
38
+
39
+ Property Value
40
+ Base tokenizer LLaMA tokenizer
41
+ Vocabulary size 16,689
42
+ Added tokens All CJK Unified Ideographs (Unicode)
43
+ Special tokens Inherited from LLaMA (<unk>, <s>, </s>, etc.)
44
+
45
+ Experimental Purpose
46
+
47
+ This tokenizer was created for ablation studies in CJK language modeling:
48
+ 1. Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization.
49
+ 2. Method:
50
+ • Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer.
51
+ • Compare downstream performance on tasks such as language modeling, classification, and machine translation.
52
+ 3. Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena.
53
+
54
+ Contributing
55
+
56
+ Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.
57
+
58
+ License
59
+
60
+ This project is licensed under the MIT License.
61
+
62
+ Citation
63
+
64
+ If you use this tokenizer in your research or projects, please cite:
65
+
66
+ @misc{CJKCharTokenizer2025,
67
+ author = {Your Name},
68
+ title = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA},
69
+ year = {2025},
70
+ howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}},
71
+ }
72
+