Instructions to use batuhanozkose/cevher-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use batuhanozkose/cevher-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("batuhanozkose/cevher-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Cevher Tokenizer
Cevher Tokenizer is a compact Turkish-first byte-level BPE tokenizer for Cevher language models.
This release is optimized for a practical 60/40 mix: strong Turkish coverage first, while staying useful for English and technical text such as code, math, logs, JSON, URLs, security advisories, and config files.
- Model type: Byte-Level BPE
- Vocabulary size: 64,000
- Current checkpoint:
Cevher 64k - Hard gates: exact UTF-8 roundtrip, zero UNK on eval fixtures
Install
pip install transformers tokenizers
Quick start
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
text = "İstanbul'un sokaklarında JSON logları ve Python kodu birlikte akıyor."
ids = tok.encode(text)
print(ids)
print(tok.decode(ids, skip_special_tokens=True))
Batch tokenization
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
texts = [
"Merhaba dünya! Hello world!",
"CVE-2026-1234 path=/api/v1/users?debug=true status=403",
"def hesapla(x): return x**2 + 3",
]
batch = tok(texts, padding=True, return_tensors="pt")
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
Inspect tokens
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
text = "Türkiye'de e-ticaret, HTTP logları ve yapay zekâ modelleri."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)
for token_id, token in zip(ids, tokens):
print(token_id, token)
Prompt-style text
Cevher Tokenizer is only a tokenizer. It does not define a chat template by itself. You can tokenize any plain prompt format used by your model.
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
prompt = "Kullanıcı: Bugün hava nasıl?\nAsistan:"
inputs = tok(prompt, return_tensors="pt")
print(inputs["input_ids"].shape)
print(tok.decode(inputs["input_ids"][0], skip_special_tokens=True))
Save and reload locally
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
tok.save_pretrained("./cevher-tokenizer-local")
reloaded = PreTrainedTokenizerFast.from_pretrained("./cevher-tokenizer-local")
text = "Ankara'daki öğrenciler YKS ve KPSS için çalışıyor."
assert reloaded.decode(reloaded.encode(text), skip_special_tokens=True) == text
Example coverage
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
examples = [
"Ankara'daki öğrenciler YKS ve KPSS için çalışıyor.",
"The model parses JSON, YAML, URLs, logs, and Python snippets.",
r"\\int_0^1 x^2 dx = \\frac{1}{3}",
"GET /api/v1/users?id=42 HTTP/1.1 200 OK",
"apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: cevher",
]
for text in examples:
ids = tok.encode(text)
print(len(ids), tok.decode(ids, skip_special_tokens=True) == text, text)
All examples above were smoke-tested with transformers.PreTrainedTokenizerFast.
Evaluation summary
Metric: tokens per word (TPW), lower is better. Score includes roundtrip penalty where applicable.
Cevher 64k is the public tokenizer in this repository. Cevher 49k is reported for transparency, but is currently intended for internal/Core model use only.
| rank | tokenizer | weighted TPW ↓ | score ↓ | roundtrip | UNK |
|---|---|---|---|---|---|
| 1 | Cevher 64k |
1.8266 | 1.8266 | 1.0000 | 0.000000 |
| 2 | Cevher 49k (internal use) |
1.8877 | 1.8877 | 1.0000 | 0.000000 |
| 3 | GPT2-Turkish-Cased |
2.1618 | 2.1618 | 1.0000 | 0.000000 |
| 4 | Bonur-GPT2-Turkish |
2.2559 | 2.2559 | 1.0000 | 0.000000 |
| 5 | Kumru-2B-Base |
2.3186 | 2.3186 | 1.0000 | 0.000000 |
| 6 | Llama-3.2-1B |
2.4186 | 2.4186 | 1.0000 | 0.000000 |
| 7 | Gemma-3-1B-IT |
2.5698 | 2.5698 | 1.0000 | 0.000000 |
| 8 | Gemma-2-2B |
2.5821 | 2.5821 | 1.0000 | 0.000000 |
| 9 | Qwen2.5-0.5B |
2.8433 | 2.8433 | 1.0000 | 0.000000 |
| 10 | Qwen3-0.6B |
2.8433 | 2.8433 | 1.0000 | 0.000000 |
| 11 | XLM-RoBERTa-Base |
2.3228 | 336.3228 | 0.6660 | 0.000000 |
| 12 | mT5-Small |
2.4890 | 336.4890 | 0.6660 | 0.000000 |
Domain matrix
Lower is better. Columns are grouped by broad use case.
| tokenizer | TR general | TR news/legal | TR noisy | TR morphology | TR apostrophe | TR case-i | EN general | mixed TR/EN | code | math | logs/JSON/URL | security | finance | medical | ecommerce | education | search | multilingual | emoji/symbols |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cevher 64k |
1.235 | 1.303 | 1.349 | 1.938 | 1.361 | 1.628 | 1.278 | 1.470 | 2.349 | 2.945 | 9.205 | 1.866 | 1.240 | 1.448 | 1.441 | 1.242 | 1.074 | 2.000 | 2.394 |
Cevher 49k (internal use) |
1.275 | 1.347 | 1.372 | 1.938 | 1.389 | 1.628 | 1.306 | 1.588 | 2.419 | 2.945 | 9.505 | 1.966 | 1.298 | 1.586 | 1.559 | 1.363 | 1.111 | 2.058 | 2.424 |
GPT2-Turkish-Cased |
1.249 | 1.255 | 1.349 | 2.004 | 1.444 | 1.571 | 2.528 | 2.088 | 3.509 | 3.636 | 12.462 | 2.372 | 1.327 | 1.552 | 1.584 | 1.273 | 1.323 | 2.058 | 3.021 |
Bonur-GPT2-Turkish |
1.277 | 1.355 | 1.395 | 1.975 | 1.384 | 1.686 | 2.750 | 2.177 | 3.834 | 3.607 | 13.254 | 2.511 | 1.259 | 1.724 | 1.583 | 1.303 | 1.487 | 2.206 | 3.262 |
Kumru-2B-Base |
1.294 | 1.278 | 1.535 | 1.792 | 1.672 | 1.628 | 2.417 | 2.382 | 3.579 | 3.442 | 14.275 | 2.700 | 1.576 | 1.655 | 1.908 | 1.181 | 1.481 | 2.353 | 3.145 |
Llama-3.2-1B |
1.980 | 2.320 | 1.884 | 3.204 | 2.136 | 2.143 | 1.194 | 1.588 | 2.093 | 2.612 | 7.905 | 2.400 | 2.421 | 2.483 | 2.383 | 2.212 | 1.593 | 2.147 | 2.394 |
Gemma-3-1B-IT |
2.000 | 2.298 | 1.860 | 3.013 | 2.625 | 2.314 | 1.167 | 1.706 | 2.463 | 2.720 | 10.775 | 2.634 | 2.365 | 2.345 | 2.438 | 2.061 | 1.593 | 1.970 | 2.266 |
Gemma-2-2B |
2.020 | 2.398 | 1.814 | 3.200 | 2.597 | 2.228 | 1.167 | 1.676 | 2.439 | 2.720 | 10.574 | 2.600 | 2.444 | 2.345 | 2.438 | 2.242 | 1.444 | 1.970 | 2.357 |
Qwen2.5-0.5B |
2.431 | 2.778 | 2.116 | 3.900 | 2.753 | 2.485 | 1.194 | 1.853 | 2.230 | 2.664 | 9.573 | 2.834 | 2.839 | 2.897 | 2.791 | 2.364 | 1.926 | 2.294 | 2.691 |
Qwen3-0.6B |
2.431 | 2.778 | 2.116 | 3.900 | 2.753 | 2.485 | 1.194 | 1.853 | 2.230 | 2.664 | 9.573 | 2.834 | 2.839 | 2.897 | 2.791 | 2.364 | 1.926 | 2.294 | 2.691 |
XLM-RoBERTa-Base |
1.613 | 1.814 | 1.628 | 2.442 | 2.005 | 1.971 | 1.500 | 1.882 | 3.074 | 3.254 | 11.458 | 2.306 | 1.815 | 2.000 | 1.773 | 1.727 | 1.492 | 1.941 | 2.286 |
mT5-Small |
1.908 | 2.184 | 1.930 | 2.913 | 2.602 | 2.257 | 1.695 | 2.000 | 2.879 | 3.019 | 8.860 | 2.434 | 2.013 | 2.138 | 2.028 | 2.030 | 1.825 | 2.235 | 2.348 |
Tokenization examples against top competitors
The following examples compare token counts on complex mixed texts. Lower token count means the same text fits into fewer model tokens.
Competitors shown here are the top three non-Cevher tokenizers by overall evaluation score: GPT2-Turkish-Cased, Bonur-GPT2-Turkish, and Kumru-2B-Base.
| example | Cevher 64k | Cevher 49k (internal use) | GPT2-Turkish-Cased | Bonur-GPT2-Turkish | Kumru-2B-Base |
|---|---|---|---|---|---|
| Turkish legal + finance | 59 | 59 | 63 | 64 | 85 |
| Mixed TR/EN technical support | 67 | 69 | 84 | 89 | 97 |
| Code + YAML config | 76 | 83 | 103 | 111 | 90 |
| Math + LaTeX physics | 63 | 63 | 77 | 78 | 78 |
| Logs + JSON + URL | 75 | 78 | 99 | 104 | 112 |
| Noisy social + ecommerce | 46 | 48 | 52 | 47 | 57 |
| English product + policy | 38 | 39 | 68 | 70 | 63 |
| English technical incident | 36 | 37 | 64 | 63 | 62 |
| English research/math | 37 | 38 | 58 | 59 | 62 |
Example texts used for the table
Turkish legal + finance
İstanbul 12. Asliye Ticaret Mahkemesi'nin 2026/142 E. sayılı dosyasında, şirketin 31.03.2026 tarihli bilançosunda özkaynak/borç oranı %18,7 olarak raporlandı; USD/TRY 41,25 seviyesindeyken BIST-100 endeksi gün içinde %2,14 yükseldi.
Mixed TR/EN technical support
Kullanıcı login olamıyor: frontend `401 Unauthorized` alıyor, backend loglarında `jwt expired at 2026-06-24T10:15:33Z` görünüyor; refresh-token endpoint'i `/api/v2/auth/refresh?tenant=tr-prod` üzerinden 302 dönmüş.
Code + YAML config
```python
def normalize_user(row: dict) -> dict:
return {"id": int(row["id"]), "şehir": row.get("city", "İstanbul")}
```
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cevher-api
```
Math + LaTeX physics
Enerji denklemi $E=mc^2$ ile başlar; ayrıca $\nabla\cdot\vec{E}=\rho/\epsilon_0$ ve $\int_0^1 x^2 dx = 1/3$ ifadeleri modelin LaTeX, Unicode ve Türkçe açıklama karışımını nasıl böldüğünü gösterir.
Logs + JSON + URL
2026-06-24T12:44:09.331Z level=ERROR service=payment trace_id=tr_7fa9 status=503 url=https://api.example.com/v1/orders?id=42¤cy=TRY payload={"user":"ayşe","amount":1299.90,"retry":true}
Noisy social + ecommerce
abi ürün baya iyiymiş 😂 ama kargo 3 günde geldi... satıcıya sordum 'iade kodu nerede?' dedi ki Hepsiburada panelinden alıyosun; fiyat 1.299,90 TL -> kuponla 999 TL oldu #efsane
English product + policy
The subscription renewal failed because the billing address was updated after the invoice was issued; please verify the tax ID, payment method, and regional compliance settings before retrying.
English technical incident
The Kubernetes pod entered CrashLoopBackOff after the readiness probe timed out at /healthz; logs show connection refused on port 8080 and a missing DATABASE_URL secret.
English research/math
In the ablation study, perplexity decreased from 8.42 to 7.91 while the tokenizer reduced average sequence length by 6.3% on English scientific abstracts.
Interpretation
Cevher Tokenizer is strongest on the target distribution: Turkish-first text plus practical English/technical coverage. It wins the overall weighted evaluation while using a compact 64k vocabulary.
Where Cevher is strongest:
- Turkish general text
- Turkish noisy/social text
- Turkish apostrophes and proper-name suffixes
- Mixed Turkish-English text
- Turkish finance, medical, ecommerce, and search/query style text
- Security/config style text
Where larger external tokenizers can still be better:
- Llama-style tokenizers are stronger on code, math, and logs/URL-heavy text.
- Gemma-style tokenizers are stronger on English-only and emoji/symbol-heavy slices.
- Kumru is strong on some Turkish morphology and education fixtures.
- GPT2-Turkish remains strong on some news/legal and Turkish case-i fixtures.
This is an intended tradeoff: Cevher is not a giant universal tokenizer. It is a small Turkish-first tokenizer with enough English and technical coverage to serve Cevher models efficiently.
Intended use
Use this tokenizer for Cevher model pretraining, fine-tuning, evaluation, and inference when the expected text distribution is Turkish-heavy but includes English and technical content.
Good fit:
- Turkish LLMs
- Turkish-English mixed applications
- Turkish web, social, finance, medical, ecommerce, education, and search data
- Code/log/config/security/math text appearing inside mostly Turkish or bilingual corpora
Less ideal as a standalone choice:
- English-only LLMs
- code-only models
- math-only models
- multilingual models where hundreds of languages have equal priority
Limitations
Tokenizer efficiency does not measure model quality. A lower TPW helps sequence efficiency but does not guarantee better downstream accuracy.
The evaluation set is intrinsic and tokenizer-focused. It measures segmentation efficiency, exact roundtrip behavior, UNK rate, and byte fallback rate. It does not measure reasoning, factuality, safety, or task performance.
License and provenance
The tokenizer artifact is released under Apache-2.0.
Training data was prepared for tokenizer learning, not for direct text generation. The corpus mix was designed around Turkish-first coverage with English and technical support. It includes Turkish web/news/wiki-style text, Turkish vertical-domain text, English educational/technical text, code/config/log/math/security components, and deterministic synthetic text for rare but important patterns such as URLs, JSON, Kubernetes/YAML, CVE/security strings, Turkish apostrophes, and Turkish casing.
Source preparation included normalization and configured PII/secret scrubbing for relevant components. The tokenizer stores learned tokenization statistics only; it is not a language model and cannot reproduce training documents by itself.