KOZTAM — Koklear Okuma Zorluğu Tahminleme Modeli
KOZTAM (Cochlear Reading-Difficulty Forecasting Model) grades a Turkish text on an eight-level reading-difficulty scale (grades 1–8) with a frozen BERTurk encoder and a lightweight CORAL ordinal head, reaching 0.33 ordinal MAE, 0.72 exact-grade accuracy, and 0.96 within-one-grade accuracy on held-out test data.
Data
KOZTAM is trained on the behAIvNET/KOZTAM dataset — 1,588 synthetic Turkish texts spanning grades 1–8 in two categories (informative / narrative), generated under strict MEB-aligned readability targets. One exact-duplicate text (present under both a grade-3 and a grade-5 folder) was removed to eliminate a conflicting ordinal label, giving the final 1,588.
The texts are partitioned into 1,110 training / 239 validation / 239 test by two-stage stratified sampling over the 16 grade × category strata, so both the ordinal grade distribution and the ≈ 50/50 category balance are preserved in every split. Because the BERTurk backbone is frozen, each text's [CLS] embedding (768-d) is pre-computed once and cached, and only the head is trained on those cached vectors; the mean [CLS] norm is ≈ 25.2 and identical across the three splits, indicating no distributional shift between them.
A structural property worth noting for modelling: realized text length rises from grade 1 through grade 7 but drops at grade 8 (shorter, in both categories, than grades 5–7). Length is therefore not a monotone proxy for reading grade at the top of the scale, so the model must rely on lexical and syntactic density rather than on length.
Model
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "dbmdz/bert-base-turkish-cased"
MAX_LEN = 512
K = 8
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE).eval()
for p in bert.parameters():
p.requires_grad_(False)
@torch.no_grad()
def encode_cls(texts):
enc = tokenizer(texts, padding=True, truncation=True,
max_length=MAX_LEN, return_tensors="pt").to(DEVICE)
return bert(**enc).last_hidden_state[:, 0, :]
class CoralHead(nn.Module):
def __init__(self, in_dim, num_classes):
super().__init__()
self.fc = nn.Linear(in_dim, 1, bias=False)
self.bias = nn.Parameter(torch.zeros(num_classes - 1))
def forward(self, x):
return self.fc(x) + self.bias
class KOZTAMNet(nn.Module):
def __init__(self, in_dim=768, p=0.2):
super().__init__()
self.proj = nn.Sequential(
nn.Linear(in_dim, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(p),
nn.Linear(256, 64))
self.ordinal = CoralHead(64, K)
self.category = nn.Linear(64, 1)
def forward(self, x):
z = self.proj(x)
return z, self.ordinal(z), self.category(z).squeeze(-1)
ckpt = torch.load("KOZTAM.pt", map_location=DEVICE)
head = KOZTAMNet().to(DEVICE)
head.load_state_dict(ckpt["state_dict"])
head.eval()
thresholds = torch.tensor(ckpt["coral_thresholds"], device=DEVICE)
CATEGORY = {0: "bilgilendirici (informative)", 1: "öyküleyici (narrative)"}
@torch.no_grad()
def predict(texts):
single = isinstance(texts, str)
cls = encode_cls([texts] if single else list(texts))
z = head.proj(cls)
score = head.ordinal.fc(z).squeeze(-1)
grade = (score[:, None] > thresholds[None, :]).sum(1) + 1
cat = (torch.sigmoid(head.category(z).squeeze(-1)) > 0.5).long()
out = [{"grade": int(g), "score": round(float(s), 4), "category": CATEGORY[int(c)]}
for g, s, c in zip(grade, score, cat)]
return out[0] if single else out
print(predict("Sample text."))
Performance
Evaluated on the held-out test set (239 texts) the model never saw during training.
Headline metrics (calibrated).
| Metric | Value |
|---|---|
| Ordinal MAE | 0.326 |
| Exact-grade accuracy | 0.715 |
| ±1-grade accuracy | 0.958 |
| Spearman ρ (predicted vs. true grade) | 0.965 |
| Text-category accuracy | 1.000 |
Threshold calibration. KOZTAM has two parts: the trained network, which outputs a continuous difficulty score, and seven calibrated thresholds that cut that score into discrete grades (stored in the checkpoint, applied at inference). The raw network already ranks texts near-perfectly — continuous-score Spearman ρ = 0.966 — but the default 0.5-thresholds mis-placed the cut points and collapsed predictions toward the extreme grades. Re-placing the thresholds on the validation split (coordinate descent on MAE, no re-training) produced the final metrics:
| Metric (test) | Raw (0.5-threshold) | Calibrated |
|---|---|---|
| Ordinal MAE | 0.837 | 0.326 |
| Exact accuracy | 0.423 | 0.715 |
| ±1 accuracy | 0.782 | 0.958 |
| Spearman ρ | 0.924 | 0.965 |
On the validation split the same procedure moved MAE from 0.837 to 0.285, confirming the calibration generalizes rather than overfitting the test set.
Continuous difficulty score by grade (test). The mean score increases monotonically across all eight grades with no inversion, which is the direct evidence that the encoder learned the reading-difficulty ordering:
| Grade | Mean score ± SD |
|---|---|
| 1 | −6.91 ± 1.18 |
| 2 | −2.24 ± 1.52 |
| 3 | −0.99 ± 0.48 |
| 4 | −0.33 ± 0.45 |
| 5 | 0.15 ± 0.29 |
| 6 | 0.73 ± 0.50 |
| 7 | 1.43 ± 0.63 |
| 8 | 4.07 ± 1.49 |
Per-grade MAE (calibrated, test).
| Grade | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| MAE | 0.033 | 0.133 | 0.333 | 0.690 | 0.400 | 0.367 | 0.567 | 0.100 |
The model is sharpest at the extremes (grades 1, 8) and at distinct mid-levels; its largest residual errors sit at the 3–4 and 6–7 boundaries, where texts are genuinely close in readability. On the raw predictions the two categories were balanced (informative 0.832 / narrative 0.842) and truncated texts (>512 tokens, n = 12) were only marginally harder (1.000 vs. 0.828), so neither category nor truncation is a source of systematic error.
Confusion matrix (calibrated, test). Rows = true grade, columns = predicted grade.
| true \ pred | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| 1 | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 26 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 9 | 20 | 1 | 0 | 0 | 0 | 0 |
| 4 | 0 | 1 | 11 | 12 | 3 | 2 | 0 | 0 |
| 5 | 0 | 0 | 1 | 5 | 19 | 5 | 0 | 0 |
| 6 | 0 | 0 | 0 | 2 | 4 | 22 | 1 | 1 |
| 7 | 0 | 0 | 0 | 0 | 2 | 8 | 15 | 5 |
| 8 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 28 |
Ablation (test). Each loss component was removed in turn, with everything else — seed, initialization, batch order, scheduler, early stopping, and the same post-hoc threshold calibration — held identical.
| Variant | Pure ρ | MAE | Exact acc | ±1 acc |
|---|---|---|---|---|
| Full (ordinal + triplet + category) | 0.966 | 0.326 | 0.715 | 0.958 |
| − triplet | 0.963 | 0.402 | 0.636 | 0.962 |
| − category | 0.964 | 0.410 | 0.636 | 0.954 |
| Ordinal only | 0.962 | 0.410 | 0.628 | 0.962 |
Removing the ordinal-aware triplet loss raises MAE by 0.075 while pure ranking ρ is essentially unchanged, showing the triplet term improves the even spacing of the embedding axis (its calibratability) rather than the ranking itself.
Explainability (XAI)
Reading-difficulty attributions are produced with Integrated Gradients (IG) on an end-to-end version of the model (frozen BERTurk → continuous difficulty score), with token attributions merged back to whole words. Before attribution, the end-to-end score is verified against the cached score of the same text — they match to within 1.1 × 10⁻⁵, confirming IG explains the exact deployed model. IG is run with n_steps = 50, and the convergence delta is 0.056 (negligible relative to the score magnitude), so the completeness axiom holds.
Positive attribution means a word pushes the difficulty score up (harder); negative means it pushes down (easier). Three representative test cases are shown below. Because IG distributes each text's total score relative to an empty-content baseline, the absolute magnitude of the attributions scales with how far a text's score sits from that baseline — so magnitudes are comparable within a text, not across texts.
Case 1 — clear-easy (true grade 1, predicted 1).
| Increases difficulty | value | Decreases difficulty | value | |
|---|---|---|---|---|
| elmanın | +0.54 | neşeyle | −0.27 | |
| o | +0.33 | güzelce | −0.26 | |
| kuralımızdır | +0.27 | denizi | −0.22 | |
| Dünyadaki | +0.24 | yapıp | −0.20 | |
| yansıtır | +0.23 | üç | −0.16 |
Case 2 — clear-hard (true grade 7, predicted 7).
| Increases difficulty | value | Decreases difficulty | value | |
|---|---|---|---|---|
| ilerlerler | +0.26 | kıyafetlerin | −0.06 | |
| abartıya | +0.13 | hızlı | −0.05 | |
| ufkunu | +0.07 | Kendi | −0.05 | |
| koruyanlar | +0.06 | Kendi | −0.04 | |
| sadeliği | +0.06 |
Case 3 — boundary error (true grade 6, predicted 7).
| Increases difficulty | value | Decreases difficulty | value | |
|---|---|---|---|---|
| alışkanlıklarımızda | +0.69 | temizliğimizi | −0.66 | |
| yaşamımızdaki | +0.60 | ve | −0.31 | |
| cildimizin | +0.51 | seçmesi | −0.28 | |
| kolaylaştırır | +0.23 | temelidir | −0.28 |
The boundary case is the most informative: the words that pushed a grade-6 text up into grade 7 are long, morphologically heavy Turkish words (alışkanlıklarımızda, yaşamımızdaki, cildimizin). This is direct evidence that the model reads morphological and lexical density as a difficulty signal, and that its single-grade error falls exactly on the kind of neighbouring boundary where the difficulty distinction is genuinely soft.
Model tree for behAIvNET/KOZTAM
Base model
dbmdz/bert-base-turkish-cased



