KotodamaLM Japanese Tokenizer
KotodamaLM is a Japanese-first SentencePiece tokenizer for LLM experiments.
The name comes from kotodama, the Japanese idea that words carry spirit or
power. Here the concept is deliberately practical: a compact Japanese tokenizer
with explicit benchmark-source separation.
This tokenizer is grouped with JPN-Bench in the Hugging Face collection "KotodamaLM Japanese Language Infrastructure".
Files
tokenizer.model: SentencePiece unigram model.tokenizer.vocab: SentencePiece vocabulary.tokenizer_config.json: lightweight metadata for downstream tools.special_tokens_map.json: special-token metadata.sentencepiece_config.json: training/runtime notes for direct SentencePiece use.evaluation/jpn_bench_literacy.json: JPN-Bench tokenizer-literacy report.evaluation/jpn_bench_seed.json: seed-set report.training/tokenizer_benchmark_exclusion.json: tokenizer-training exclusion audit from released benchmark surfaces.training/corpus_benchmark_exclusion_audit.json: strict corpus-level benchmark exclusion audit.
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("日本語の文章をトークン化します。", out_type=str)
ids = sp.encode("日本語の文章をトークン化します。", out_type=int)
Special tokens:
<unk><s></s><JA><EN><TRANSLATE>
The model also uses byte fallback, so unknown Japanese surfaces should remain representable even when they are fragmented.
Training Separation
This tokenizer was trained from the tokenizer-training source lane, not the JPN-Bench benchmark lane. Before training, the corpus was filtered against released JPN-Bench target surfaces:
- benchmark exclusion terms: 265
- input lines: 91,826
- kept lines: 65,394
- excluded lines: 26,432
This makes the tokenizer useful as a clean no-public-JPN-Bench-material baseline, though the strict filter also removes many common Japanese words and characters.
JPN-Bench Result
Current public JPN-Bench tokenizer-literacy result:
{
"items": 60,
"passed_items": 33,
"pass_rate": 0.55,
"targets": 209,
"target_passes": 178,
"target_pass_rate": 0.8517,
"total_unk": 0
}
The most visible weakness is fragmentation of common single-kanji targets after
strict benchmark exclusion. Examples include 私, 本, 水, 父, and 駅,
which often fall back to byte pieces. This is expected for a clean baseline,
not a claim of production-quality Japanese tokenization.
Intended Use
- Japanese tokenizer experiments
- LLM corpus-packing experiments
- JPN-Bench regression tracking
- source-separation and anti-overfit baselines
Limitations
- Tokenizer-only artifact, not a full pretrained LLM.
- The paired prototype LLM run was intentionally not uploaded here.
- Strict public benchmark exclusion reduces coverage of high-frequency Japanese surfaces.
- JPN-Bench public literacy scores should be treated as dev/regression scores, not hidden benchmark scores.
Citation
@misc{kotodamalm_japanese_tokenizer_2026,
title = {KotodamaLM Japanese Tokenizer},
author = {MarcoDotIO},
year = {2026},
publisher = {Hugging Face},
license = {CC-BY-4.0}
}