KotodamaLM Japanese Tokenizer

KotodamaLM is a Japanese-first SentencePiece tokenizer for LLM experiments. The name comes from kotodama, the Japanese idea that words carry spirit or power. Here the concept is deliberately practical: a compact Japanese tokenizer with explicit benchmark-source separation.

This tokenizer is grouped with JPN-Bench in the Hugging Face collection "KotodamaLM Japanese Language Infrastructure".

Files

  • tokenizer.model: SentencePiece unigram model.
  • tokenizer.vocab: SentencePiece vocabulary.
  • tokenizer_config.json: lightweight metadata for downstream tools.
  • special_tokens_map.json: special-token metadata.
  • sentencepiece_config.json: training/runtime notes for direct SentencePiece use.
  • evaluation/jpn_bench_literacy.json: JPN-Bench tokenizer-literacy report.
  • evaluation/jpn_bench_seed.json: seed-set report.
  • training/tokenizer_benchmark_exclusion.json: tokenizer-training exclusion audit from released benchmark surfaces.
  • training/corpus_benchmark_exclusion_audit.json: strict corpus-level benchmark exclusion audit.

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("日本語の文章をトークン化します。", out_type=str)
ids = sp.encode("日本語の文章をトークン化します。", out_type=int)

Special tokens:

  • <unk>
  • <s>
  • </s>
  • <JA>
  • <EN>
  • <TRANSLATE>

The model also uses byte fallback, so unknown Japanese surfaces should remain representable even when they are fragmented.

Training Separation

This tokenizer was trained from the tokenizer-training source lane, not the JPN-Bench benchmark lane. Before training, the corpus was filtered against released JPN-Bench target surfaces:

  • benchmark exclusion terms: 265
  • input lines: 91,826
  • kept lines: 65,394
  • excluded lines: 26,432

This makes the tokenizer useful as a clean no-public-JPN-Bench-material baseline, though the strict filter also removes many common Japanese words and characters.

JPN-Bench Result

Current public JPN-Bench tokenizer-literacy result:

{
  "items": 60,
  "passed_items": 33,
  "pass_rate": 0.55,
  "targets": 209,
  "target_passes": 178,
  "target_pass_rate": 0.8517,
  "total_unk": 0
}

The most visible weakness is fragmentation of common single-kanji targets after strict benchmark exclusion. Examples include , , , , and , which often fall back to byte pieces. This is expected for a clean baseline, not a claim of production-quality Japanese tokenization.

Intended Use

  • Japanese tokenizer experiments
  • LLM corpus-packing experiments
  • JPN-Bench regression tracking
  • source-separation and anti-overfit baselines

Limitations

  • Tokenizer-only artifact, not a full pretrained LLM.
  • The paired prototype LLM run was intentionally not uploaded here.
  • Strict public benchmark exclusion reduces coverage of high-frequency Japanese surfaces.
  • JPN-Bench public literacy scores should be treated as dev/regression scores, not hidden benchmark scores.

Citation

@misc{kotodamalm_japanese_tokenizer_2026,
  title = {KotodamaLM Japanese Tokenizer},
  author = {MarcoDotIO},
  year = {2026},
  publisher = {Hugging Face},
  license = {CC-BY-4.0}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MarcoDotIO/kotodama-lm-ja-tokenizer