Instructions to use cijov/cijov-lang-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cijov/cijov-lang-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cijov/cijov-lang-tokenizer") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cijov/cijov-lang-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cijov/cijov-lang-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cijov/cijov-lang-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cijov/cijov-lang-tokenizer", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cijov/cijov-lang-tokenizer
- SGLang
How to use cijov/cijov-lang-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cijov/cijov-lang-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cijov/cijov-lang-tokenizer", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cijov/cijov-lang-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cijov/cijov-lang-tokenizer", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use cijov/cijov-lang-tokenizer with Docker Model Runner:
docker model run hf.co/cijov/cijov-lang-tokenizer
Cijov-lang Tokenizer
A byte-level BPE tokenizer trained from scratch on a multilingual corpus covering English, French, Spanish, and Romanian, with additional coverage of Python code and mathematics.
Overview
| Property | Value |
|---|---|
| Algorithm | Byte-level BPE |
| Vocab size | 151,936 |
| Languages | EN, FR, ES, RO |
| Additional domains | Python code, mathematics |
| Special tokens | 25 (ChatML + tool-call + FIM) |
| Training data | |
| License | Apache 2.0 |
Training Data Sources
The tokenizer was trained on a balanced multilingual corpus collected from publicly available datasets:
| Source | Languages | Proportion |
|---|---|---|
| FineWeb-2 | EN, FR, ES, RO | ~68% (web text) |
| Wikipedia | EN, FR, ES, RO | ~15% (encyclopedic) |
| CodeXGlue (Python) | Python | ~2.4% (code) |
| FineMath | EN | ~2.4% (mathematics) |
Each language received equal document counts to ensure balanced merge learning across all four target languages.
Architecture
- Pre-tokenizer: Byte-level with GPT-2 regex splitting (whitespace-aware)
- Model: BPE with merges learned entirely from the training corpus above
- Decoder: Byte-level (lossless roundtrip for any UTF-8 input)
- Post-processor: Byte-level with untrimmed offsets
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|endoftext|> |
151643 | End of document / padding |
<|im_start|> |
151644 | Chat turn start (ChatML) |
<|im_end|> |
151645 | Chat turn end (ChatML) |
<tool_call> |
151657 | Tool/function call start |
</tool_call> |
151658 | Tool/function call end |
<|fim_prefix|> |
151659 | Fill-in-the-middle prefix |
<|fim_middle|> |
151660 | Fill-in-the-middle middle |
<|fim_suffix|> |
151661 | Fill-in-the-middle suffix |
<tool_response> |
151665 | Tool response start |
</tool_response> |
151666 | Tool response end |
<|cijov|> |
151667 | Model identity sentinel |
Full list of 25 special tokens available in special_tokens_map.json.
Chat Template
Built-in ChatML template:
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cijov/cijov-lang-tokenizer")
# Encode text
text = "Once upon a time in a faraway land"
ids = tokenizer.encode(text)
print(f"Tokens: {len(ids)}")
# Chat template
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story."},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)
Performance
Compression efficiency (characters per token) on held-out samples:
| Language | Cijov-lang | Qwen3 baseline | Improvement |
|---|---|---|---|
| English | 4.17 | 4.17 | same |
| French | 4.27 | 3.53 | +21% |
| Spanish | 4.31 | 3.21 | +34% |
| Romanian | 3.97 | 2.26 | +76% |
Higher is better (more characters encoded per token = more efficient).
Files
βββ tokenizer.json # Full BPE vocab + merges
βββ tokenizer_config.json # HF tokenizer configuration
βββ special_tokens_map.json # Special token definitions
βββ chat_template.jinja # Standalone chat template
Training Procedure
- Corpus collection: Streamed ~200k documents per language from public HuggingFace datasets (web, Wikipedia, code, math).
- BPE training: Byte-level BPE with minimum frequency threshold of 2, learning merges until reaching 151,936 vocabulary entries.
- Special token anchoring: Reserved token padding ensures special tokens land at fixed IDs (151643β151667) regardless of learned vocab.
- Validation: Verified roundtrip integrity, compression ratios, and special token ID correctness.
Intended Use
This tokenizer is designed for:
- Multilingual text generation (EN/FR/ES/RO)
- Code completion (Python)
- Mathematical reasoning
- Chat / instruction-following (ChatML format)
- Fill-in-the-middle code completion (FIM tokens)
Limitations
- Optimised for Latin-script languages. CJK / Arabic / Cyrillic coverage exists (byte-level guarantees no UNK) but compression will be poor.
- Trained on publicly available web data β inherits any biases present in the source corpora.
Citation
@misc{cijov-lang-tokenizer-2026,
title = {Cijov-lang Tokenizer: A Multilingual Byte-Level BPE Tokenizer},
author = {Cijov},
year = {2026},
url = {https://huggingface.co/cijov/cijov-lang-tokenizer}
}
License
Apache 2.0