Cevher Tokenizer

Cevher Tokenizer is a compact Turkish-first byte-level BPE tokenizer for Cevher language models.

This release is optimized for a practical 60/40 mix: strong Turkish coverage first, while staying useful for English and technical text such as code, math, logs, JSON, URLs, security advisories, and config files.

Model type: Byte-Level BPE
Vocabulary size: 64,000
Current checkpoint: Cevher 64k
Hard gates: exact UTF-8 roundtrip, zero UNK on eval fixtures

Install

pip install transformers tokenizers

Quick start

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

text = "İstanbul'un sokaklarında JSON logları ve Python kodu birlikte akıyor."
ids = tok.encode(text)

print(ids)
print(tok.decode(ids, skip_special_tokens=True))

Batch tokenization

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

texts = [
    "Merhaba dünya! Hello world!",
    "CVE-2026-1234 path=/api/v1/users?debug=true status=403",
    "def hesapla(x): return x**2 + 3",
]

batch = tok(texts, padding=True, return_tensors="pt")
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)

Inspect tokens

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

text = "Türkiye'de e-ticaret, HTTP logları ve yapay zekâ modelleri."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)

for token_id, token in zip(ids, tokens):
    print(token_id, token)

Prompt-style text

Cevher Tokenizer is only a tokenizer. It does not define a chat template by itself. You can tokenize any plain prompt format used by your model.

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

prompt = "Kullanıcı: Bugün hava nasıl?\nAsistan:"
inputs = tok(prompt, return_tensors="pt")
print(inputs["input_ids"].shape)
print(tok.decode(inputs["input_ids"][0], skip_special_tokens=True))

Save and reload locally

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
tok.save_pretrained("./cevher-tokenizer-local")

reloaded = PreTrainedTokenizerFast.from_pretrained("./cevher-tokenizer-local")
text = "Ankara'daki öğrenciler YKS ve KPSS için çalışıyor."
assert reloaded.decode(reloaded.encode(text), skip_special_tokens=True) == text

Example coverage

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

examples = [
    "Ankara'daki öğrenciler YKS ve KPSS için çalışıyor.",
    "The model parses JSON, YAML, URLs, logs, and Python snippets.",
    r"\\int_0^1 x^2 dx = \\frac{1}{3}",
    "GET /api/v1/users?id=42 HTTP/1.1 200 OK",
    "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: cevher",
]

for text in examples:
    ids = tok.encode(text)
    print(len(ids), tok.decode(ids, skip_special_tokens=True) == text, text)

All examples above were smoke-tested with transformers.PreTrainedTokenizerFast.

Evaluation summary

Metric: tokens per word (TPW), lower is better. Score includes roundtrip penalty where applicable.

Cevher 64k is the public tokenizer in this repository. Cevher 49k is reported for transparency, but is currently intended for internal/Core model use only.

rank	tokenizer	weighted TPW ↓	score ↓	roundtrip
1	`Cevher 64k`	1.8266	1.8266	1.0000
2	`Cevher 49k` (internal use)	1.8877	1.8877	1.0000
3	`GPT2-Turkish-Cased`	2.1618	2.1618	1.0000
4	`Bonur-GPT2-Turkish`	2.2559	2.2559	1.0000
5	`Kumru-2B-Base`	2.3186	2.3186	1.0000
6	`Llama-3.2-1B`	2.4186	2.4186	1.0000
7	`Gemma-3-1B-IT`	2.5698	2.5698	1.0000
8	`Gemma-2-2B`	2.5821	2.5821	1.0000
9	`Qwen2.5-0.5B`	2.8433	2.8433	1.0000
10	`Qwen3-0.6B`	2.8433	2.8433	1.0000
11	`XLM-RoBERTa-Base`	2.3228	336.3228	0.6660
12	`mT5-Small`	2.4890	336.4890	0.6660

Domain matrix

Lower is better. Columns are grouped by broad use case.

tokenizer	TR general	TR news/legal	TR noisy	TR morphology	TR apostrophe	TR case-i	EN general	mixed TR/EN	code	math	logs/JSON/URL	security	finance	medical	ecommerce	education	search	multilingual	emoji/symbols
`Cevher 64k`	1.235	1.303	1.349	1.938	1.361	1.628	1.278	1.470	2.349	2.945	9.205	1.866	1.240	1.448	1.441	1.242	1.074	2.000	2.394
`Cevher 49k` (internal use)	1.275	1.347	1.372	1.938	1.389	1.628	1.306	1.588	2.419	2.945	9.505	1.966	1.298	1.586	1.559	1.363	1.111	2.058	2.424
`GPT2-Turkish-Cased`	1.249	1.255	1.349	2.004	1.444	1.571	2.528	2.088	3.509	3.636	12.462	2.372	1.327	1.552	1.584	1.273	1.323	2.058	3.021
`Bonur-GPT2-Turkish`	1.277	1.355	1.395	1.975	1.384	1.686	2.750	2.177	3.834	3.607	13.254	2.511	1.259	1.724	1.583	1.303	1.487	2.206	3.262
`Kumru-2B-Base`	1.294	1.278	1.535	1.792	1.672	1.628	2.417	2.382	3.579	3.442	14.275	2.700	1.576	1.655	1.908	1.181	1.481	2.353	3.145
`Llama-3.2-1B`	1.980	2.320	1.884	3.204	2.136	2.143	1.194	1.588	2.093	2.612	7.905	2.400	2.421	2.483	2.383	2.212	1.593	2.147	2.394
`Gemma-3-1B-IT`	2.000	2.298	1.860	3.013	2.625	2.314	1.167	1.706	2.463	2.720	10.775	2.634	2.365	2.345	2.438	2.061	1.593	1.970	2.266
`Gemma-2-2B`	2.020	2.398	1.814	3.200	2.597	2.228	1.167	1.676	2.439	2.720	10.574	2.600	2.444	2.345	2.438	2.242	1.444	1.970	2.357
`Qwen2.5-0.5B`	2.431	2.778	2.116	3.900	2.753	2.485	1.194	1.853	2.230	2.664	9.573	2.834	2.839	2.897	2.791	2.364	1.926	2.294	2.691
`Qwen3-0.6B`	2.431	2.778	2.116	3.900	2.753	2.485	1.194	1.853	2.230	2.664	9.573	2.834	2.839	2.897	2.791	2.364	1.926	2.294	2.691
`XLM-RoBERTa-Base`	1.613	1.814	1.628	2.442	2.005	1.971	1.500	1.882	3.074	3.254	11.458	2.306	1.815	2.000	1.773	1.727	1.492	1.941	2.286
`mT5-Small`	1.908	2.184	1.930	2.913	2.602	2.257	1.695	2.000	2.879	3.019	8.860	2.434	2.013	2.138	2.028	2.030	1.825	2.235	2.348

Tokenization examples against top competitors

The following examples compare token counts on complex mixed texts. Lower token count means the same text fits into fewer model tokens.

Competitors shown here are the top three non-Cevher tokenizers by overall evaluation score: GPT2-Turkish-Cased, Bonur-GPT2-Turkish, and Kumru-2B-Base.

example	Cevher 64k	Cevher 49k (internal use)	GPT2-Turkish-Cased	Bonur-GPT2-Turkish	Kumru-2B-Base
Turkish legal + finance	59	59	63	64	85
Mixed TR/EN technical support	67	69	84	89	97
Code + YAML config	76	83	103	111	90
Math + LaTeX physics	63	63	77	78	78
Logs + JSON + URL	75	78	99	104	112
Noisy social + ecommerce	46	48	52	47	57
English product + policy	38	39	68	70	63
English technical incident	36	37	64	63	62
English research/math	37	38	58	59	62

Example texts used for the table

Turkish legal + finance

İstanbul 12. Asliye Ticaret Mahkemesi'nin 2026/142 E. sayılı dosyasında, şirketin 31.03.2026 tarihli bilançosunda özkaynak/borç oranı %18,7 olarak raporlandı; USD/TRY 41,25 seviyesindeyken BIST-100 endeksi gün içinde %2,14 yükseldi.

Mixed TR/EN technical support

Kullanıcı login olamıyor: frontend `401 Unauthorized` alıyor, backend loglarında `jwt expired at 2026-06-24T10:15:33Z` görünüyor; refresh-token endpoint'i `/api/v2/auth/refresh?tenant=tr-prod` üzerinden 302 dönmüş.

Code + YAML config

```python
def normalize_user(row: dict) -> dict:
    return {"id": int(row["id"]), "şehir": row.get("city", "İstanbul")}
```
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cevher-api
```

Math + LaTeX physics

Enerji denklemi $E=mc^2$ ile başlar; ayrıca $\nabla\cdot\vec{E}=\rho/\epsilon_0$ ve $\int_0^1 x^2 dx = 1/3$ ifadeleri modelin LaTeX, Unicode ve Türkçe açıklama karışımını nasıl böldüğünü gösterir.

Logs + JSON + URL

2026-06-24T12:44:09.331Z level=ERROR service=payment trace_id=tr_7fa9 status=503 url=https://api.example.com/v1/orders?id=42&currency=TRY payload={"user":"ayşe","amount":1299.90,"retry":true}

Noisy social + ecommerce

abi ürün baya iyiymiş 😂 ama kargo 3 günde geldi... satıcıya sordum 'iade kodu nerede?' dedi ki Hepsiburada panelinden alıyosun; fiyat 1.299,90 TL -> kuponla 999 TL oldu #efsane

English product + policy

The subscription renewal failed because the billing address was updated after the invoice was issued; please verify the tax ID, payment method, and regional compliance settings before retrying.

English technical incident

The Kubernetes pod entered CrashLoopBackOff after the readiness probe timed out at /healthz; logs show connection refused on port 8080 and a missing DATABASE_URL secret.

English research/math

In the ablation study, perplexity decreased from 8.42 to 7.91 while the tokenizer reduced average sequence length by 6.3% on English scientific abstracts.

Interpretation

Cevher Tokenizer is strongest on the target distribution: Turkish-first text plus practical English/technical coverage. It wins the overall weighted evaluation while using a compact 64k vocabulary.

Where Cevher is strongest:

Turkish general text
Turkish noisy/social text
Turkish apostrophes and proper-name suffixes
Mixed Turkish-English text
Turkish finance, medical, ecommerce, and search/query style text
Security/config style text

Where larger external tokenizers can still be better:

Llama-style tokenizers are stronger on code, math, and logs/URL-heavy text.
Gemma-style tokenizers are stronger on English-only and emoji/symbol-heavy slices.
Kumru is strong on some Turkish morphology and education fixtures.
GPT2-Turkish remains strong on some news/legal and Turkish case-i fixtures.

This is an intended tradeoff: Cevher is not a giant universal tokenizer. It is a small Turkish-first tokenizer with enough English and technical coverage to serve Cevher models efficiently.

Intended use

Use this tokenizer for Cevher model pretraining, fine-tuning, evaluation, and inference when the expected text distribution is Turkish-heavy but includes English and technical content.

Good fit:

Turkish LLMs
Turkish-English mixed applications
Turkish web, social, finance, medical, ecommerce, education, and search data
Code/log/config/security/math text appearing inside mostly Turkish or bilingual corpora

Less ideal as a standalone choice:

English-only LLMs
code-only models
math-only models
multilingual models where hundreds of languages have equal priority

Limitations

Tokenizer efficiency does not measure model quality. A lower TPW helps sequence efficiency but does not guarantee better downstream accuracy.

The evaluation set is intrinsic and tokenizer-focused. It measures segmentation efficiency, exact roundtrip behavior, UNK rate, and byte fallback rate. It does not measure reasoning, factuality, safety, or task performance.

License and provenance

The tokenizer artifact is released under Apache-2.0.

Training data was prepared for tokenizer learning, not for direct text generation. The corpus mix was designed around Turkish-first coverage with English and technical support. It includes Turkish web/news/wiki-style text, Turkish vertical-domain text, English educational/technical text, code/config/log/math/security components, and deterministic synthetic text for rare but important patterns such as URLs, JSON, Kubernetes/YAML, CVE/security strings, Turkish apostrophes, and Turkish casing.

Source preparation included normalization and configured PII/secret scrubbing for relevant components. The tokenizer stores learned tokenization statistics only; it is not a language model and cannot reproduce training documents by itself.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support