KotodamaLM Japanese Tokenizer

KotodamaLM is a Japanese-first SentencePiece tokenizer for LLM experiments. The name comes from kotodama, the Japanese idea that words carry spirit or power. Here the concept is deliberately practical: a compact Japanese tokenizer with explicit benchmark-source separation.

This tokenizer is grouped with JPN-Bench in the Hugging Face collection "KotodamaLM Japanese Language Infrastructure".

Files

tokenizer.model: SentencePiece unigram model.
tokenizer.vocab: SentencePiece vocabulary.
tokenizer_config.json: lightweight metadata for downstream tools.
special_tokens_map.json: special-token metadata.
sentencepiece_config.json: training/runtime notes for direct SentencePiece use.
evaluation/jpn_bench_literacy.json: JPN-Bench tokenizer-literacy report.
evaluation/jpn_bench_seed.json: seed-set report.
training/tokenizer_benchmark_exclusion.json: tokenizer-training exclusion audit from released benchmark surfaces.
training/corpus_benchmark_exclusion_audit.json: strict corpus-level benchmark exclusion audit.

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("日本語の文章をトークン化します。", out_type=str)
ids = sp.encode("日本語の文章をトークン化します。", out_type=int)

Special tokens:

<unk>
<s>
</s>
<JA>
<EN>
<TRANSLATE>

The model also uses byte fallback, so unknown Japanese surfaces should remain representable even when they are fragmented.

Training Separation

This tokenizer was trained from the tokenizer-training source lane, not the JPN-Bench benchmark lane. Before training, the corpus was filtered against released JPN-Bench target surfaces:

benchmark exclusion terms: 265
input lines: 91,826
kept lines: 65,394
excluded lines: 26,432

This makes the tokenizer useful as a clean no-public-JPN-Bench-material baseline, though the strict filter also removes many common Japanese words and characters.

JPN-Bench Result

Current public JPN-Bench tokenizer-literacy result:

{
  "items": 60,
  "passed_items": 33,
  "pass_rate": 0.55,
  "targets": 209,
  "target_passes": 178,
  "target_pass_rate": 0.8517,
  "total_unk": 0
}

The most visible weakness is fragmentation of common single-kanji targets after strict benchmark exclusion. Examples include 私, 本, 水, 父, and 駅, which often fall back to byte pieces. This is expected for a clean baseline, not a claim of production-quality Japanese tokenization.

Intended Use

Japanese tokenizer experiments
LLM corpus-packing experiments
JPN-Bench regression tracking
source-separation and anti-overfit baselines

Limitations

Tokenizer-only artifact, not a full pretrained LLM.
The paired prototype LLM run was intentionally not uploaded here.
Strict public benchmark exclusion reduces coverage of high-frequency Japanese surfaces.
JPN-Bench public literacy scores should be treated as dev/regression scores, not hidden benchmark scores.

Citation

@misc{kotodamalm_japanese_tokenizer_2026,
  title = {KotodamaLM Japanese Tokenizer},
  author = {MarcoDotIO},
  year = {2026},
  publisher = {Hugging Face},
  license = {CC-BY-4.0}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MarcoDotIO/kotodama-lm-ja-tokenizer

KotodamaLM Japanese Language Infrastructure

Collection

Japanese tokenizer and benchmark artifacts with strict source separation. • 2 items • Updated Apr 25