AgentBull
/

CJK-Tokenizer

Model card Files Files and versions

CJK-Tokenizer / README.md

0xDing's picture

Update README.md

ba3bf20 verified 10 months ago

|

history blame contribute delete

2.3 kB

	---
	license: llama2
	library_name: transformers
	---

	# CJK Tokenizer

	A BPE tokenizer that extends the LLaMA tokenizer vocabulary with all Unicode CJK characters. This tokenizer is designed for ablation experiments to evaluate the impact of using character-level tokenization instead of subword-level tokenization on the semantic understanding of CJK text.

	## Features

	- Extended Vocabulary
	- Base: [LLaMA tokenizer](https://github.com/meta-llama/llama)
	- Added: 16,689 unique CJK characters covering the full Unicode CJK blocks
	- Character-Level
	- Each CJK character is treated as its own token
	- Other scripts (Latin, punctuation, etc.) follow the original LLaMA subword scheme

	## Installation

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("AgentBull/CJK-Tokenizer")

	text = "天地玄黃，宇宙洪荒。日月盈昃，辰宿列張。"
	tokens = tokenizer(text)
	print(tokens.tokens)
	```


	Tokenizer Details

	Property Value
	Base tokenizer LLaMA tokenizer
	Vocabulary size 16,689
	Added tokens All CJK Unified Ideographs (Unicode)
	Special tokens Inherited from LLaMA (<unk>, <s>, </s>, etc.)

	Experimental Purpose

	This tokenizer was created for ablation studies in CJK language modeling:
	1. Hypothesis: Character-level tokenization of CJK scripts may improve or alter model semantic understanding compared to subword tokenization.
	2. Method:
	• Train or fine-tune identical LLM architectures using (a) the standard LLaMA tokenizer, and (b) this CJKCharTokenizer.
	• Compare downstream performance on tasks such as language modeling, classification, and machine translation.
	3. Analysis: Evaluate metrics like perplexity, downstream task accuracy, and qualitative behavior on CJK-specific phenomena.

	Contributing

	Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.

	License

	This project is licensed under the MIT License.

	Citation

	If you use this tokenizer in your research or projects, please cite:

	@misc{CJKCharTokenizer2025,
	author = {Your Name},
	title = {CJKCharTokenizer: A Character-Level CJK Tokenizer for LLaMA},
	year = {2025},
	howpublished = {\\url{https://huggingface.co/AgentBull/CJK-Tokenizer}},
	}