AgentBull
/

CJK-Tokenizer

Model card Files Files and versions

0xDing commited on Apr 20, 2025

Commit

2e9d2fe

·

verified ·

1 Parent(s): f040328

Update README.md

Files changed (1) hide show

README.md +72 -3

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: transformers
+---
+# CJK Tokenizer
+A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using **character-level** tokenization instead of subword-level tokenization on the semantic understanding of CJK text.
+## Features
+- **Extended Vocabulary**
+  - Base: [LLaMA tokenizer](https://github.com/meta-llama/llama)
+  - Added: 16,689 unique CJK characters covering the full Unicode CJK blocks
+- **Character-Level**
+  - Each CJK character is treated as its own token
+  - Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme
+- **Hugging Face Integration**
+  - Fully compatible with the `transformers` library
+  - Supports `from_pretrained`, `save_pretrained`, and standard tokenizer methods
+## Installation
+from transformers import AutoTokenizer
+# Load the tokenizer by repository name or local path
+tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")
+# Tokenize some CJK text
+text = "天地玄黃， 宇宙洪荒。 日月盈昃， 辰宿列張。"
+tokens = tokenizer(text)
+print(tokens.tokens)
+Tokenizer Details
+Property	Value
+Base tokenizer	LLaMA tokenizer
+Vocabulary size	16,689
+Added tokens	All CJK Unified Ideographs (Unicode)
+Special tokens	Inherited from LLaMA (<unk>, <s>, </s>, etc.)
+Experimental Purpose
+This tokenizer was created for ablation studies in CJK language modeling:
+	1.	Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization.
+	2.	Method:
+	•	Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer.
+	•	Compare downstream performance on tasks such as language modeling, classification, and machine translation.
+	3.	Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena.
+Contributing
+Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.
+License
+This project is licensed under the MIT License.
+Citation
+If you use this tokenizer in your research or projects, please cite:
+@misc{CJKCharTokenizer2025,
+  author       = {Your Name},
+  title        = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA},
+  year         = {2025},
+  howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}},
+}