Cijov-lang Tokenizer

A byte-level BPE tokenizer trained from scratch on a multilingual corpus covering English, French, Spanish, and Romanian, with additional coverage of Python code and mathematics.

Overview

Property Value
Algorithm Byte-level BPE
Vocab size 151,936
Languages EN, FR, ES, RO
Additional domains Python code, mathematics
Special tokens 25 (ChatML + tool-call + FIM)
Training data 840k documents (1.3 GB raw text)
License Apache 2.0

Training Data Sources

The tokenizer was trained on a balanced multilingual corpus collected from publicly available datasets:

Source Languages Proportion
FineWeb-2 EN, FR, ES, RO ~68% (web text)
Wikipedia EN, FR, ES, RO ~15% (encyclopedic)
CodeXGlue (Python) Python ~2.4% (code)
FineMath EN ~2.4% (mathematics)

Each language received equal document counts to ensure balanced merge learning across all four target languages.

Architecture

  • Pre-tokenizer: Byte-level with GPT-2 regex splitting (whitespace-aware)
  • Model: BPE with merges learned entirely from the training corpus above
  • Decoder: Byte-level (lossless roundtrip for any UTF-8 input)
  • Post-processor: Byte-level with untrimmed offsets

Special Tokens

Token ID Purpose
<|endoftext|> 151643 End of document / padding
<|im_start|> 151644 Chat turn start (ChatML)
<|im_end|> 151645 Chat turn end (ChatML)
<tool_call> 151657 Tool/function call start
</tool_call> 151658 Tool/function call end
<|fim_prefix|> 151659 Fill-in-the-middle prefix
<|fim_middle|> 151660 Fill-in-the-middle middle
<|fim_suffix|> 151661 Fill-in-the-middle suffix
<tool_response> 151665 Tool response start
</tool_response> 151666 Tool response end
<|cijov|> 151667 Model identity sentinel

Full list of 25 special tokens available in special_tokens_map.json.

Chat Template

Built-in ChatML template:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("cijov/cijov-lang-tokenizer")

# Encode text
text = "Once upon a time in a faraway land"
ids = tokenizer.encode(text)
print(f"Tokens: {len(ids)}")

# Chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a story."},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)

Performance

Compression efficiency (characters per token) on held-out samples:

Language Cijov-lang Qwen3 baseline Improvement
English 4.17 4.17 same
French 4.27 3.53 +21%
Spanish 4.31 3.21 +34%
Romanian 3.97 2.26 +76%

Higher is better (more characters encoded per token = more efficient).

Files

β”œβ”€β”€ tokenizer.json            # Full BPE vocab + merges
β”œβ”€β”€ tokenizer_config.json     # HF tokenizer configuration
β”œβ”€β”€ special_tokens_map.json   # Special token definitions
└── chat_template.jinja       # Standalone chat template

Training Procedure

  1. Corpus collection: Streamed ~200k documents per language from public HuggingFace datasets (web, Wikipedia, code, math).
  2. BPE training: Byte-level BPE with minimum frequency threshold of 2, learning merges until reaching 151,936 vocabulary entries.
  3. Special token anchoring: Reserved token padding ensures special tokens land at fixed IDs (151643–151667) regardless of learned vocab.
  4. Validation: Verified roundtrip integrity, compression ratios, and special token ID correctness.

Intended Use

This tokenizer is designed for:

  • Multilingual text generation (EN/FR/ES/RO)
  • Code completion (Python)
  • Mathematical reasoning
  • Chat / instruction-following (ChatML format)
  • Fill-in-the-middle code completion (FIM tokens)

Limitations

  • Optimised for Latin-script languages. CJK / Arabic / Cyrillic coverage exists (byte-level guarantees no UNK) but compression will be poor.
  • Trained on publicly available web data β€” inherits any biases present in the source corpora.

Citation

@misc{cijov-lang-tokenizer-2026,
  title  = {Cijov-lang Tokenizer: A Multilingual Byte-Level BPE Tokenizer},
  author = {Cijov},
  year   = {2026},
  url    = {https://huggingface.co/cijov/cijov-lang-tokenizer}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support