--- language: - tr license: mit library_name: pytorch pipeline_tag: text-classification base_model: dbmdz/bert-base-turkish-cased datasets: - behAIvNET/KOZTAM tags: - turkish - education - special-education - K-8 - reading - reading-difficulty - reading-texts - synthetic-dataset - hearing-loss - cochlear - bert - berturk - coral - ordinal-regression - xai - integrated-gradients metrics: - mae - accuracy --- # KOZTAM — Koklear Okuma Zorluğu Tahminleme Modeli **KOZTAM** (*Cochlear Reading-Difficulty Forecasting Model*) grades a Turkish text on an eight-level reading-difficulty scale (grades 1–8) with a frozen BERTurk encoder and a lightweight CORAL ordinal head, reaching **0.33 ordinal MAE**, **0.72 exact-grade accuracy**, and **0.96 within-one-grade accuracy** on held-out test data. ## Data KOZTAM is trained on the [`behAIvNET/KOZTAM`](https://huggingface.co/datasets/behAIvNET/KOZTAM) dataset — **1,588 synthetic Turkish texts** spanning grades 1–8 in two categories (informative / narrative), generated under strict MEB-aligned readability targets. One exact-duplicate text (present under both a grade-3 and a grade-5 folder) was removed to eliminate a conflicting ordinal label, giving the final 1,588. The texts are partitioned into **1,110 training / 239 validation / 239 test** by two-stage stratified sampling over the 16 grade × category strata, so both the ordinal grade distribution and the ≈ 50/50 category balance are preserved in every split. Because the BERTurk backbone is frozen, each text's `[CLS]` embedding (768-d) is pre-computed once and cached, and only the head is trained on those cached vectors; the mean `[CLS]` norm is ≈ 25.2 and identical across the three splits, indicating no distributional shift between them. A structural property worth noting for modelling: realized text length rises from grade 1 through grade 7 but **drops at grade 8** (shorter, in both categories, than grades 5–7). Length is therefore not a monotone proxy for reading grade at the top of the scale, so the model must rely on lexical and syntactic density rather than on length. ## Model ```python import torch import torch.nn as nn from transformers import AutoTokenizer, AutoModel DEVICE = "cuda" if torch.cuda.is_available() else "cpu" MODEL_NAME = "dbmdz/bert-base-turkish-cased" MAX_LEN = 512 K = 8 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) bert = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE).eval() for p in bert.parameters(): p.requires_grad_(False) @torch.no_grad() def encode_cls(texts): enc = tokenizer(texts, padding=True, truncation=True, max_length=MAX_LEN, return_tensors="pt").to(DEVICE) return bert(**enc).last_hidden_state[:, 0, :] class CoralHead(nn.Module): def __init__(self, in_dim, num_classes): super().__init__() self.fc = nn.Linear(in_dim, 1, bias=False) self.bias = nn.Parameter(torch.zeros(num_classes - 1)) def forward(self, x): return self.fc(x) + self.bias class KOZTAMNet(nn.Module): def __init__(self, in_dim=768, p=0.2): super().__init__() self.proj = nn.Sequential( nn.Linear(in_dim, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(p), nn.Linear(256, 64)) self.ordinal = CoralHead(64, K) self.category = nn.Linear(64, 1) def forward(self, x): z = self.proj(x) return z, self.ordinal(z), self.category(z).squeeze(-1) ckpt = torch.load("KOZTAM.pt", map_location=DEVICE) head = KOZTAMNet().to(DEVICE) head.load_state_dict(ckpt["state_dict"]) head.eval() thresholds = torch.tensor(ckpt["coral_thresholds"], device=DEVICE) CATEGORY = {0: "bilgilendirici (informative)", 1: "öyküleyici (narrative)"} @torch.no_grad() def predict(texts): single = isinstance(texts, str) cls = encode_cls([texts] if single else list(texts)) z = head.proj(cls) score = head.ordinal.fc(z).squeeze(-1) grade = (score[:, None] > thresholds[None, :]).sum(1) + 1 cat = (torch.sigmoid(head.category(z).squeeze(-1)) > 0.5).long() out = [{"grade": int(g), "score": round(float(s), 4), "category": CATEGORY[int(c)]} for g, s, c in zip(grade, score, cat)] return out[0] if single else out print(predict("Sample text.")) ``` ## Performance Evaluated on the held-out **test set (239 texts)** the model never saw during training. **Headline metrics (calibrated).** | Metric | Value | |---|---| | Ordinal MAE | **0.326** | | Exact-grade accuracy | **0.715** | | ±1-grade accuracy | **0.958** | | Spearman ρ (predicted vs. true grade) | **0.965** | | Text-category accuracy | **1.000** | **Threshold calibration.** KOZTAM has two parts: the trained network, which outputs a continuous *difficulty score*, and seven **calibrated thresholds** that cut that score into discrete grades (stored in the checkpoint, applied at inference). The raw network already ranks texts near-perfectly — continuous-score Spearman ρ = **0.966** — but the default 0.5-thresholds mis-placed the cut points and collapsed predictions toward the extreme grades. Re-placing the thresholds on the validation split (coordinate descent on MAE, **no re-training**) produced the final metrics: | Metric (test) | Raw (0.5-threshold) | Calibrated | |---|---|---| | Ordinal MAE | 0.837 | 0.326 | | Exact accuracy | 0.423 | 0.715 | | ±1 accuracy | 0.782 | 0.958 | | Spearman ρ | 0.924 | 0.965 | On the validation split the same procedure moved MAE from 0.837 to 0.285, confirming the calibration generalizes rather than overfitting the test set. **Continuous difficulty score by grade (test).** The mean score increases monotonically across all eight grades with no inversion, which is the direct evidence that the encoder learned the reading-difficulty ordering: | Grade | Mean score ± SD | |---|---| | 1 | −6.91 ± 1.18 | | 2 | −2.24 ± 1.52 | | 3 | −0.99 ± 0.48 | | 4 | −0.33 ± 0.45 | | 5 | 0.15 ± 0.29 | | 6 | 0.73 ± 0.50 | | 7 | 1.43 ± 0.63 | | 8 | 4.07 ± 1.49 | **Per-grade MAE (calibrated, test).** | Grade | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |---|---|---|---|---|---|---|---|---| | MAE | 0.033 | 0.133 | 0.333 | 0.690 | 0.400 | 0.367 | 0.567 | 0.100 | The model is sharpest at the extremes (grades 1, 8) and at distinct mid-levels; its largest residual errors sit at the **3–4 and 6–7 boundaries**, where texts are genuinely close in readability. On the raw predictions the two categories were balanced (informative 0.832 / narrative 0.842) and truncated texts (>512 tokens, n = 12) were only marginally harder (1.000 vs. 0.828), so neither category nor truncation is a source of systematic error. **Confusion matrix (calibrated, test).** Rows = true grade, columns = predicted grade. | true \ pred | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |---|---|---|---|---|---|---|---|---| | **1** | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | | **2** | 3 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | | **3** | 0 | 9 | 20 | 1 | 0 | 0 | 0 | 0 | | **4** | 0 | 1 | 11 | 12 | 3 | 2 | 0 | 0 | | **5** | 0 | 0 | 1 | 5 | 19 | 5 | 0 | 0 | | **6** | 0 | 0 | 0 | 2 | 4 | 22 | 1 | 1 | | **7** | 0 | 0 | 0 | 0 | 2 | 8 | 15 | 5 | | **8** | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 28 | ![Training dynamics — loss and validation MAE/accuracy over epochs](training_curves.png) ![Confusion matrix — test set](confusion_matrix.png) **Ablation (test).** Each loss component was removed in turn, with everything else — seed, initialization, batch order, scheduler, early stopping, and the same post-hoc threshold calibration — held identical. | Variant | Pure ρ | MAE | Exact acc | ±1 acc | |---|---|---|---|---| | Full (ordinal + triplet + category) | 0.966 | 0.326 | 0.715 | 0.958 | | − triplet | 0.963 | 0.402 | 0.636 | 0.962 | | − category | 0.964 | 0.410 | 0.636 | 0.954 | | Ordinal only | 0.962 | 0.410 | 0.628 | 0.962 | Removing the ordinal-aware triplet loss raises MAE by 0.075 while pure ranking ρ is essentially unchanged, showing the triplet term improves the *even spacing* of the embedding axis (its calibratability) rather than the ranking itself. ## Explainability (XAI) Reading-difficulty attributions are produced with **Integrated Gradients (IG)** on an end-to-end version of the model (frozen BERTurk → continuous difficulty score), with token attributions merged back to whole words. Before attribution, the end-to-end score is verified against the cached score of the same text — they match to within **1.1 × 10⁻⁵**, confirming IG explains the exact deployed model. IG is run with `n_steps = 50`, and the convergence delta is 0.056 (negligible relative to the score magnitude), so the completeness axiom holds. Positive attribution means a word pushes the difficulty score **up** (harder); negative means it pushes **down** (easier). Three representative test cases are shown below. Because IG distributes each text's total score relative to an empty-content baseline, the absolute magnitude of the attributions scales with how far a text's score sits from that baseline — so magnitudes are comparable *within* a text, not across texts. **Case 1 — clear-easy (true grade 1, predicted 1).** | Increases difficulty | value | | Decreases difficulty | value | |---|---|---|---|---| | elmanın | +0.54 | | neşeyle | −0.27 | | o | +0.33 | | güzelce | −0.26 | | kuralımızdır | +0.27 | | denizi | −0.22 | | Dünyadaki | +0.24 | | yapıp | −0.20 | | yansıtır | +0.23 | | üç | −0.16 | ![XAI word heatmap — clear-easy case](xai_easy.png) **Case 2 — clear-hard (true grade 7, predicted 7).** | Increases difficulty | value | | Decreases difficulty | value | |---|---|---|---|---| | ilerlerler | +0.26 | | kıyafetlerin | −0.06 | | abartıya | +0.13 | | hızlı | −0.05 | | ufkunu | +0.07 | | Kendi | −0.05 | | koruyanlar | +0.06 | | Kendi | −0.04 | | sadeliği | +0.06 | | | | ![XAI word heatmap — clear-hard case](xai_hard.png) **Case 3 — boundary error (true grade 6, predicted 7).** | Increases difficulty | value | | Decreases difficulty | value | |---|---|---|---|---| | alışkanlıklarımızda | +0.69 | | temizliğimizi | −0.66 | | yaşamımızdaki | +0.60 | | ve | −0.31 | | cildimizin | +0.51 | | seçmesi | −0.28 | | kolaylaştırır | +0.23 | | temelidir | −0.28 | ![XAI word heatmap — boundary-error case](xai_boundary.png) The boundary case is the most informative: the words that pushed a grade-6 text up into grade 7 are long, morphologically heavy Turkish words (*alışkanlıklarımızda*, *yaşamımızdaki*, *cildimizin*). This is direct evidence that the model reads morphological and lexical density as a difficulty signal, and that its single-grade error falls exactly on the kind of neighbouring boundary where the difficulty distinction is genuinely soft.