Instructions to use cijov/cijov-lang-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cijov/cijov-lang-tokenizer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cijov/cijov-lang-tokenizer")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("cijov/cijov-lang-tokenizer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cijov/cijov-lang-tokenizer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cijov/cijov-lang-tokenizer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cijov/cijov-lang-tokenizer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cijov/cijov-lang-tokenizer

SGLang

How to use cijov/cijov-lang-tokenizer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cijov/cijov-lang-tokenizer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cijov/cijov-lang-tokenizer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cijov/cijov-lang-tokenizer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cijov/cijov-lang-tokenizer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use cijov/cijov-lang-tokenizer with Docker Model Runner:
```
docker model run hf.co/cijov/cijov-lang-tokenizer
```

Cijov-lang Tokenizer

A byte-level BPE tokenizer trained from scratch on a multilingual corpus covering English, French, Spanish, and Romanian, with additional coverage of Python code and mathematics.

Overview

Property	Value
Algorithm	Byte-level BPE
Vocab size	151,936
Languages	EN, FR, ES, RO
Additional domains	Python code, mathematics
Special tokens	25 (ChatML + tool-call + FIM)
Training data	~~840k documents (~~1.3 GB raw text)
License	Apache 2.0

Training Data Sources

The tokenizer was trained on a balanced multilingual corpus collected from publicly available datasets:

Source	Languages	Proportion
FineWeb-2	EN, FR, ES, RO	~68% (web text)
Wikipedia	EN, FR, ES, RO	~15% (encyclopedic)
CodeXGlue (Python)	Python	~2.4% (code)
FineMath	EN	~2.4% (mathematics)

Each language received equal document counts to ensure balanced merge learning across all four target languages.

Architecture

Pre-tokenizer: Byte-level with GPT-2 regex splitting (whitespace-aware)
Model: BPE with merges learned entirely from the training corpus above
Decoder: Byte-level (lossless roundtrip for any UTF-8 input)
Post-processor: Byte-level with untrimmed offsets

Special Tokens

Token	ID	Purpose
`<\|endoftext\|>`	151643	End of document / padding
`<\|im_start\|>`	151644	Chat turn start (ChatML)
`<\|im_end\|>`	151645	Chat turn end (ChatML)
`<tool_call>`	151657	Tool/function call start
`</tool_call>`	151658	Tool/function call end
`<\|fim_prefix\|>`	151659	Fill-in-the-middle prefix
`<\|fim_middle\|>`	151660	Fill-in-the-middle middle
`<\|fim_suffix\|>`	151661	Fill-in-the-middle suffix
`<tool_response>`	151665	Tool response start
`</tool_response>`	151666	Tool response end
`<\|cijov\|>`	151667	Model identity sentinel

Full list of 25 special tokens available in special_tokens_map.json.

Chat Template

Built-in ChatML template:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("cijov/cijov-lang-tokenizer")

# Encode text
text = "Once upon a time in a faraway land"
ids = tokenizer.encode(text)
print(f"Tokens: {len(ids)}")

# Chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a story."},
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)

Performance

Compression efficiency (characters per token) on held-out samples:

Language	Cijov-lang	Qwen3 baseline	Improvement
English	4.17	4.17	same
French	4.27	3.53	+21%
Spanish	4.31	3.21	+34%
Romanian	3.97	2.26	+76%

Higher is better (more characters encoded per token = more efficient).

Files

├── tokenizer.json            # Full BPE vocab + merges
├── tokenizer_config.json     # HF tokenizer configuration
├── special_tokens_map.json   # Special token definitions
└── chat_template.jinja       # Standalone chat template

Training Procedure

Corpus collection: Streamed ~200k documents per language from public HuggingFace datasets (web, Wikipedia, code, math).
BPE training: Byte-level BPE with minimum frequency threshold of 2, learning merges until reaching 151,936 vocabulary entries.
Special token anchoring: Reserved token padding ensures special tokens land at fixed IDs (151643–151667) regardless of learned vocab.
Validation: Verified roundtrip integrity, compression ratios, and special token ID correctness.

Intended Use

This tokenizer is designed for:

Multilingual text generation (EN/FR/ES/RO)
Code completion (Python)
Mathematical reasoning
Chat / instruction-following (ChatML format)
Fill-in-the-middle code completion (FIM tokens)

Limitations

Optimised for Latin-script languages. CJK / Arabic / Cyrillic coverage exists (byte-level guarantees no UNK) but compression will be poor.
Trained on publicly available web data — inherits any biases present in the source corpora.

Citation

@misc{cijov-lang-tokenizer-2026,
  title  = {Cijov-lang Tokenizer: A Multilingual Byte-Level BPE Tokenizer},
  author = {Cijov},
  year   = {2026},
  url    = {https://huggingface.co/cijov/cijov-lang-tokenizer}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track