Cevher Tokenizer

Cevher Tokenizer is a compact Turkish-first byte-level BPE tokenizer for Cevher language models.

This release is optimized for a practical 60/40 mix: strong Turkish coverage first, while staying useful for English and technical text such as code, math, logs, JSON, URLs, security advisories, and config files.

  • Model type: Byte-Level BPE
  • Vocabulary size: 64,000
  • Current checkpoint: Cevher 64k
  • Hard gates: exact UTF-8 roundtrip, zero UNK on eval fixtures

Install

pip install transformers tokenizers

Quick start

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

text = "İstanbul'un sokaklarında JSON logları ve Python kodu birlikte akıyor."
ids = tok.encode(text)

print(ids)
print(tok.decode(ids, skip_special_tokens=True))

Batch tokenization

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

texts = [
    "Merhaba dünya! Hello world!",
    "CVE-2026-1234 path=/api/v1/users?debug=true status=403",
    "def hesapla(x): return x**2 + 3",
]

batch = tok(texts, padding=True, return_tensors="pt")
print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)

Inspect tokens

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

text = "Türkiye'de e-ticaret, HTTP logları ve yapay zekâ modelleri."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)

for token_id, token in zip(ids, tokens):
    print(token_id, token)

Prompt-style text

Cevher Tokenizer is only a tokenizer. It does not define a chat template by itself. You can tokenize any plain prompt format used by your model.

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

prompt = "Kullanıcı: Bugün hava nasıl?\nAsistan:"
inputs = tok(prompt, return_tensors="pt")
print(inputs["input_ids"].shape)
print(tok.decode(inputs["input_ids"][0], skip_special_tokens=True))

Save and reload locally

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")
tok.save_pretrained("./cevher-tokenizer-local")

reloaded = PreTrainedTokenizerFast.from_pretrained("./cevher-tokenizer-local")
text = "Ankara'daki öğrenciler YKS ve KPSS için çalışıyor."
assert reloaded.decode(reloaded.encode(text), skip_special_tokens=True) == text

Example coverage

from transformers import PreTrainedTokenizerFast

tok = PreTrainedTokenizerFast.from_pretrained("batuhanozkose/cevher-tokenizer")

examples = [
    "Ankara'daki öğrenciler YKS ve KPSS için çalışıyor.",
    "The model parses JSON, YAML, URLs, logs, and Python snippets.",
    r"\\int_0^1 x^2 dx = \\frac{1}{3}",
    "GET /api/v1/users?id=42 HTTP/1.1 200 OK",
    "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: cevher",
]

for text in examples:
    ids = tok.encode(text)
    print(len(ids), tok.decode(ids, skip_special_tokens=True) == text, text)

All examples above were smoke-tested with transformers.PreTrainedTokenizerFast.

Evaluation summary

Metric: tokens per word (TPW), lower is better. Score includes roundtrip penalty where applicable.

Cevher 64k is the public tokenizer in this repository. Cevher 49k is reported for transparency, but is currently intended for internal/Core model use only.

rank tokenizer weighted TPW ↓ score ↓ roundtrip UNK
1 Cevher 64k 1.8266 1.8266 1.0000 0.000000
2 Cevher 49k (internal use) 1.8877 1.8877 1.0000 0.000000
3 GPT2-Turkish-Cased 2.1618 2.1618 1.0000 0.000000
4 Bonur-GPT2-Turkish 2.2559 2.2559 1.0000 0.000000
5 Kumru-2B-Base 2.3186 2.3186 1.0000 0.000000
6 Llama-3.2-1B 2.4186 2.4186 1.0000 0.000000
7 Gemma-3-1B-IT 2.5698 2.5698 1.0000 0.000000
8 Gemma-2-2B 2.5821 2.5821 1.0000 0.000000
9 Qwen2.5-0.5B 2.8433 2.8433 1.0000 0.000000
10 Qwen3-0.6B 2.8433 2.8433 1.0000 0.000000
11 XLM-RoBERTa-Base 2.3228 336.3228 0.6660 0.000000
12 mT5-Small 2.4890 336.4890 0.6660 0.000000

Domain matrix

Lower is better. Columns are grouped by broad use case.

tokenizer TR general TR news/legal TR noisy TR morphology TR apostrophe TR case-i EN general mixed TR/EN code math logs/JSON/URL security finance medical ecommerce education search multilingual emoji/symbols
Cevher 64k 1.235 1.303 1.349 1.938 1.361 1.628 1.278 1.470 2.349 2.945 9.205 1.866 1.240 1.448 1.441 1.242 1.074 2.000 2.394
Cevher 49k (internal use) 1.275 1.347 1.372 1.938 1.389 1.628 1.306 1.588 2.419 2.945 9.505 1.966 1.298 1.586 1.559 1.363 1.111 2.058 2.424
GPT2-Turkish-Cased 1.249 1.255 1.349 2.004 1.444 1.571 2.528 2.088 3.509 3.636 12.462 2.372 1.327 1.552 1.584 1.273 1.323 2.058 3.021
Bonur-GPT2-Turkish 1.277 1.355 1.395 1.975 1.384 1.686 2.750 2.177 3.834 3.607 13.254 2.511 1.259 1.724 1.583 1.303 1.487 2.206 3.262
Kumru-2B-Base 1.294 1.278 1.535 1.792 1.672 1.628 2.417 2.382 3.579 3.442 14.275 2.700 1.576 1.655 1.908 1.181 1.481 2.353 3.145
Llama-3.2-1B 1.980 2.320 1.884 3.204 2.136 2.143 1.194 1.588 2.093 2.612 7.905 2.400 2.421 2.483 2.383 2.212 1.593 2.147 2.394
Gemma-3-1B-IT 2.000 2.298 1.860 3.013 2.625 2.314 1.167 1.706 2.463 2.720 10.775 2.634 2.365 2.345 2.438 2.061 1.593 1.970 2.266
Gemma-2-2B 2.020 2.398 1.814 3.200 2.597 2.228 1.167 1.676 2.439 2.720 10.574 2.600 2.444 2.345 2.438 2.242 1.444 1.970 2.357
Qwen2.5-0.5B 2.431 2.778 2.116 3.900 2.753 2.485 1.194 1.853 2.230 2.664 9.573 2.834 2.839 2.897 2.791 2.364 1.926 2.294 2.691
Qwen3-0.6B 2.431 2.778 2.116 3.900 2.753 2.485 1.194 1.853 2.230 2.664 9.573 2.834 2.839 2.897 2.791 2.364 1.926 2.294 2.691
XLM-RoBERTa-Base 1.613 1.814 1.628 2.442 2.005 1.971 1.500 1.882 3.074 3.254 11.458 2.306 1.815 2.000 1.773 1.727 1.492 1.941 2.286
mT5-Small 1.908 2.184 1.930 2.913 2.602 2.257 1.695 2.000 2.879 3.019 8.860 2.434 2.013 2.138 2.028 2.030 1.825 2.235 2.348

Tokenization examples against top competitors

The following examples compare token counts on complex mixed texts. Lower token count means the same text fits into fewer model tokens.

Competitors shown here are the top three non-Cevher tokenizers by overall evaluation score: GPT2-Turkish-Cased, Bonur-GPT2-Turkish, and Kumru-2B-Base.

example Cevher 64k Cevher 49k (internal use) GPT2-Turkish-Cased Bonur-GPT2-Turkish Kumru-2B-Base
Turkish legal + finance 59 59 63 64 85
Mixed TR/EN technical support 67 69 84 89 97
Code + YAML config 76 83 103 111 90
Math + LaTeX physics 63 63 77 78 78
Logs + JSON + URL 75 78 99 104 112
Noisy social + ecommerce 46 48 52 47 57
English product + policy 38 39 68 70 63
English technical incident 36 37 64 63 62
English research/math 37 38 58 59 62
Example texts used for the table

Turkish legal + finance

İstanbul 12. Asliye Ticaret Mahkemesi'nin 2026/142 E. sayılı dosyasında, şirketin 31.03.2026 tarihli bilançosunda özkaynak/borç oranı %18,7 olarak raporlandı; USD/TRY 41,25 seviyesindeyken BIST-100 endeksi gün içinde %2,14 yükseldi.

Mixed TR/EN technical support

Kullanıcı login olamıyor: frontend `401 Unauthorized` alıyor, backend loglarında `jwt expired at 2026-06-24T10:15:33Z` görünüyor; refresh-token endpoint'i `/api/v2/auth/refresh?tenant=tr-prod` üzerinden 302 dönmüş.

Code + YAML config

```python
def normalize_user(row: dict) -> dict:
    return {"id": int(row["id"]), "şehir": row.get("city", "İstanbul")}
```
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cevher-api
```

Math + LaTeX physics

Enerji denklemi $E=mc^2$ ile başlar; ayrıca $\nabla\cdot\vec{E}=\rho/\epsilon_0$ ve $\int_0^1 x^2 dx = 1/3$ ifadeleri modelin LaTeX, Unicode ve Türkçe açıklama karışımını nasıl böldüğünü gösterir.

Logs + JSON + URL

2026-06-24T12:44:09.331Z level=ERROR service=payment trace_id=tr_7fa9 status=503 url=https://api.example.com/v1/orders?id=42&currency=TRY payload={"user":"ayşe","amount":1299.90,"retry":true}

Noisy social + ecommerce

abi ürün baya iyiymiş 😂 ama kargo 3 günde geldi... satıcıya sordum 'iade kodu nerede?' dedi ki Hepsiburada panelinden alıyosun; fiyat 1.299,90 TL -> kuponla 999 TL oldu #efsane

English product + policy

The subscription renewal failed because the billing address was updated after the invoice was issued; please verify the tax ID, payment method, and regional compliance settings before retrying.

English technical incident

The Kubernetes pod entered CrashLoopBackOff after the readiness probe timed out at /healthz; logs show connection refused on port 8080 and a missing DATABASE_URL secret.

English research/math

In the ablation study, perplexity decreased from 8.42 to 7.91 while the tokenizer reduced average sequence length by 6.3% on English scientific abstracts.

Interpretation

Cevher Tokenizer is strongest on the target distribution: Turkish-first text plus practical English/technical coverage. It wins the overall weighted evaluation while using a compact 64k vocabulary.

Where Cevher is strongest:

  • Turkish general text
  • Turkish noisy/social text
  • Turkish apostrophes and proper-name suffixes
  • Mixed Turkish-English text
  • Turkish finance, medical, ecommerce, and search/query style text
  • Security/config style text

Where larger external tokenizers can still be better:

  • Llama-style tokenizers are stronger on code, math, and logs/URL-heavy text.
  • Gemma-style tokenizers are stronger on English-only and emoji/symbol-heavy slices.
  • Kumru is strong on some Turkish morphology and education fixtures.
  • GPT2-Turkish remains strong on some news/legal and Turkish case-i fixtures.

This is an intended tradeoff: Cevher is not a giant universal tokenizer. It is a small Turkish-first tokenizer with enough English and technical coverage to serve Cevher models efficiently.

Intended use

Use this tokenizer for Cevher model pretraining, fine-tuning, evaluation, and inference when the expected text distribution is Turkish-heavy but includes English and technical content.

Good fit:

  • Turkish LLMs
  • Turkish-English mixed applications
  • Turkish web, social, finance, medical, ecommerce, education, and search data
  • Code/log/config/security/math text appearing inside mostly Turkish or bilingual corpora

Less ideal as a standalone choice:

  • English-only LLMs
  • code-only models
  • math-only models
  • multilingual models where hundreds of languages have equal priority

Limitations

Tokenizer efficiency does not measure model quality. A lower TPW helps sequence efficiency but does not guarantee better downstream accuracy.

The evaluation set is intrinsic and tokenizer-focused. It measures segmentation efficiency, exact roundtrip behavior, UNK rate, and byte fallback rate. It does not measure reasoning, factuality, safety, or task performance.

License and provenance

The tokenizer artifact is released under Apache-2.0.

Training data was prepared for tokenizer learning, not for direct text generation. The corpus mix was designed around Turkish-first coverage with English and technical support. It includes Turkish web/news/wiki-style text, Turkish vertical-domain text, English educational/technical text, code/config/log/math/security components, and deterministic synthetic text for rare but important patterns such as URLs, JSON, Kubernetes/YAML, CVE/security strings, Turkish apostrophes, and Turkish casing.

Source preparation included normalization and configured PII/secret scrubbing for relevant components. The tokenizer stores learned tokenization statistics only; it is not a language model and cannot reproduce training documents by itself.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support