| --- |
| language: |
| - tr |
| license: mit |
| library_name: pytorch |
| pipeline_tag: text-classification |
| base_model: dbmdz/bert-base-turkish-cased |
| datasets: |
| - behAIvNET/KOZTAM |
| tags: |
| - turkish |
| - education |
| - special-education |
| - K-8 |
| - reading |
| - reading-difficulty |
| - reading-texts |
| - synthetic-dataset |
| - hearing-loss |
| - cochlear |
| - bert |
| - berturk |
| - coral |
| - ordinal-regression |
| - xai |
| - integrated-gradients |
| metrics: |
| - mae |
| - accuracy |
| --- |
| |
| # KOZTAM — Koklear Okuma Zorluğu Tahminleme Modeli |
|
|
| **KOZTAM** (*Cochlear Reading-Difficulty Forecasting Model*) grades a Turkish text on an eight-level reading-difficulty scale (grades 1–8) with a frozen BERTurk encoder and a lightweight CORAL ordinal head, reaching **0.33 ordinal MAE**, **0.72 exact-grade accuracy**, and **0.96 within-one-grade accuracy** on held-out test data. |
|
|
| ## Data |
|
|
| KOZTAM is trained on the [`behAIvNET/KOZTAM`](https://huggingface.co/datasets/behAIvNET/KOZTAM) dataset — **1,588 synthetic Turkish texts** spanning grades 1–8 in two categories (informative / narrative), generated under strict MEB-aligned readability targets. One exact-duplicate text (present under both a grade-3 and a grade-5 folder) was removed to eliminate a conflicting ordinal label, giving the final 1,588. |
|
|
| The texts are partitioned into **1,110 training / 239 validation / 239 test** by two-stage stratified sampling over the 16 grade × category strata, so both the ordinal grade distribution and the ≈ 50/50 category balance are preserved in every split. Because the BERTurk backbone is frozen, each text's `[CLS]` embedding (768-d) is pre-computed once and cached, and only the head is trained on those cached vectors; the mean `[CLS]` norm is ≈ 25.2 and identical across the three splits, indicating no distributional shift between them. |
|
|
| A structural property worth noting for modelling: realized text length rises from grade 1 through grade 7 but **drops at grade 8** (shorter, in both categories, than grades 5–7). Length is therefore not a monotone proxy for reading grade at the top of the scale, so the model must rely on lexical and syntactic density rather than on length. |
|
|
| ## Model |
|
|
| ```python |
| import torch |
| import torch.nn as nn |
| from transformers import AutoTokenizer, AutoModel |
| |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
| MODEL_NAME = "dbmdz/bert-base-turkish-cased" |
| MAX_LEN = 512 |
| K = 8 |
| |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) |
| bert = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE).eval() |
| for p in bert.parameters(): |
| p.requires_grad_(False) |
| |
| @torch.no_grad() |
| def encode_cls(texts): |
| enc = tokenizer(texts, padding=True, truncation=True, |
| max_length=MAX_LEN, return_tensors="pt").to(DEVICE) |
| return bert(**enc).last_hidden_state[:, 0, :] |
| |
| class CoralHead(nn.Module): |
| def __init__(self, in_dim, num_classes): |
| super().__init__() |
| self.fc = nn.Linear(in_dim, 1, bias=False) |
| self.bias = nn.Parameter(torch.zeros(num_classes - 1)) |
| def forward(self, x): |
| return self.fc(x) + self.bias |
| |
| class KOZTAMNet(nn.Module): |
| def __init__(self, in_dim=768, p=0.2): |
| super().__init__() |
| self.proj = nn.Sequential( |
| nn.Linear(in_dim, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(p), |
| nn.Linear(256, 64)) |
| self.ordinal = CoralHead(64, K) |
| self.category = nn.Linear(64, 1) |
| def forward(self, x): |
| z = self.proj(x) |
| return z, self.ordinal(z), self.category(z).squeeze(-1) |
| |
| ckpt = torch.load("KOZTAM.pt", map_location=DEVICE) |
| head = KOZTAMNet().to(DEVICE) |
| head.load_state_dict(ckpt["state_dict"]) |
| head.eval() |
| thresholds = torch.tensor(ckpt["coral_thresholds"], device=DEVICE) |
| |
| CATEGORY = {0: "bilgilendirici (informative)", 1: "öyküleyici (narrative)"} |
| |
| @torch.no_grad() |
| def predict(texts): |
| single = isinstance(texts, str) |
| cls = encode_cls([texts] if single else list(texts)) |
| z = head.proj(cls) |
| score = head.ordinal.fc(z).squeeze(-1) |
| grade = (score[:, None] > thresholds[None, :]).sum(1) + 1 |
| cat = (torch.sigmoid(head.category(z).squeeze(-1)) > 0.5).long() |
| out = [{"grade": int(g), "score": round(float(s), 4), "category": CATEGORY[int(c)]} |
| for g, s, c in zip(grade, score, cat)] |
| return out[0] if single else out |
| |
| print(predict("Sample text.")) |
| ``` |
|
|
| ## Performance |
|
|
| Evaluated on the held-out **test set (239 texts)** the model never saw during training. |
|
|
| **Headline metrics (calibrated).** |
|
|
| | Metric | Value | |
| |---|---| |
| | Ordinal MAE | **0.326** | |
| | Exact-grade accuracy | **0.715** | |
| | ±1-grade accuracy | **0.958** | |
| | Spearman ρ (predicted vs. true grade) | **0.965** | |
| | Text-category accuracy | **1.000** | |
|
|
| **Threshold calibration.** KOZTAM has two parts: the trained network, which outputs a continuous *difficulty score*, and seven **calibrated thresholds** that cut that score into discrete grades (stored in the checkpoint, applied at inference). The raw network already ranks texts near-perfectly — continuous-score Spearman ρ = **0.966** — but the default 0.5-thresholds mis-placed the cut points and collapsed predictions toward the extreme grades. Re-placing the thresholds on the validation split (coordinate descent on MAE, **no re-training**) produced the final metrics: |
|
|
| | Metric (test) | Raw (0.5-threshold) | Calibrated | |
| |---|---|---| |
| | Ordinal MAE | 0.837 | 0.326 | |
| | Exact accuracy | 0.423 | 0.715 | |
| | ±1 accuracy | 0.782 | 0.958 | |
| | Spearman ρ | 0.924 | 0.965 | |
|
|
| On the validation split the same procedure moved MAE from 0.837 to 0.285, confirming the calibration generalizes rather than overfitting the test set. |
|
|
| **Continuous difficulty score by grade (test).** The mean score increases monotonically across all eight grades with no inversion, which is the direct evidence that the encoder learned the reading-difficulty ordering: |
|
|
| | Grade | Mean score ± SD | |
| |---|---| |
| | 1 | −6.91 ± 1.18 | |
| | 2 | −2.24 ± 1.52 | |
| | 3 | −0.99 ± 0.48 | |
| | 4 | −0.33 ± 0.45 | |
| | 5 | 0.15 ± 0.29 | |
| | 6 | 0.73 ± 0.50 | |
| | 7 | 1.43 ± 0.63 | |
| | 8 | 4.07 ± 1.49 | |
|
|
| **Per-grade MAE (calibrated, test).** |
|
|
| | Grade | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| |---|---|---|---|---|---|---|---|---| |
| | MAE | 0.033 | 0.133 | 0.333 | 0.690 | 0.400 | 0.367 | 0.567 | 0.100 | |
|
|
| The model is sharpest at the extremes (grades 1, 8) and at distinct mid-levels; its largest residual errors sit at the **3–4 and 6–7 boundaries**, where texts are genuinely close in readability. On the raw predictions the two categories were balanced (informative 0.832 / narrative 0.842) and truncated texts (>512 tokens, n = 12) were only marginally harder (1.000 vs. 0.828), so neither category nor truncation is a source of systematic error. |
|
|
| **Confusion matrix (calibrated, test).** Rows = true grade, columns = predicted grade. |
|
|
| | true \ pred | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| |---|---|---|---|---|---|---|---|---| |
| | **1** | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | |
| | **2** | 3 | 26 | 1 | 0 | 0 | 0 | 0 | 0 | |
| | **3** | 0 | 9 | 20 | 1 | 0 | 0 | 0 | 0 | |
| | **4** | 0 | 1 | 11 | 12 | 3 | 2 | 0 | 0 | |
| | **5** | 0 | 0 | 1 | 5 | 19 | 5 | 0 | 0 | |
| | **6** | 0 | 0 | 0 | 2 | 4 | 22 | 1 | 1 | |
| | **7** | 0 | 0 | 0 | 0 | 2 | 8 | 15 | 5 | |
| | **8** | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 28 | |
|
|
|  |
|
|
|  |
|
|
| **Ablation (test).** Each loss component was removed in turn, with everything else — seed, initialization, batch order, scheduler, early stopping, and the same post-hoc threshold calibration — held identical. |
|
|
| | Variant | Pure ρ | MAE | Exact acc | ±1 acc | |
| |---|---|---|---|---| |
| | Full (ordinal + triplet + category) | 0.966 | 0.326 | 0.715 | 0.958 | |
| | − triplet | 0.963 | 0.402 | 0.636 | 0.962 | |
| | − category | 0.964 | 0.410 | 0.636 | 0.954 | |
| | Ordinal only | 0.962 | 0.410 | 0.628 | 0.962 | |
|
|
| Removing the ordinal-aware triplet loss raises MAE by 0.075 while pure ranking ρ is essentially unchanged, showing the triplet term improves the *even spacing* of the embedding axis (its calibratability) rather than the ranking itself. |
|
|
| ## Explainability (XAI) |
|
|
| Reading-difficulty attributions are produced with **Integrated Gradients (IG)** on an end-to-end version of the model (frozen BERTurk → continuous difficulty score), with token attributions merged back to whole words. Before attribution, the end-to-end score is verified against the cached score of the same text — they match to within **1.1 × 10⁻⁵**, confirming IG explains the exact deployed model. IG is run with `n_steps = 50`, and the convergence delta is 0.056 (negligible relative to the score magnitude), so the completeness axiom holds. |
|
|
| Positive attribution means a word pushes the difficulty score **up** (harder); negative means it pushes **down** (easier). Three representative test cases are shown below. Because IG distributes each text's total score relative to an empty-content baseline, the absolute magnitude of the attributions scales with how far a text's score sits from that baseline — so magnitudes are comparable *within* a text, not across texts. |
|
|
| **Case 1 — clear-easy (true grade 1, predicted 1).** |
|
|
| | Increases difficulty | value | | Decreases difficulty | value | |
| |---|---|---|---|---| |
| | elmanın | +0.54 | | neşeyle | −0.27 | |
| | o | +0.33 | | güzelce | −0.26 | |
| | kuralımızdır | +0.27 | | denizi | −0.22 | |
| | Dünyadaki | +0.24 | | yapıp | −0.20 | |
| | yansıtır | +0.23 | | üç | −0.16 | |
|
|
|  |
|
|
| **Case 2 — clear-hard (true grade 7, predicted 7).** |
|
|
| | Increases difficulty | value | | Decreases difficulty | value | |
| |---|---|---|---|---| |
| | ilerlerler | +0.26 | | kıyafetlerin | −0.06 | |
| | abartıya | +0.13 | | hızlı | −0.05 | |
| | ufkunu | +0.07 | | Kendi | −0.05 | |
| | koruyanlar | +0.06 | | Kendi | −0.04 | |
| | sadeliği | +0.06 | | | | |
|
|
|  |
|
|
| **Case 3 — boundary error (true grade 6, predicted 7).** |
|
|
| | Increases difficulty | value | | Decreases difficulty | value | |
| |---|---|---|---|---| |
| | alışkanlıklarımızda | +0.69 | | temizliğimizi | −0.66 | |
| | yaşamımızdaki | +0.60 | | ve | −0.31 | |
| | cildimizin | +0.51 | | seçmesi | −0.28 | |
| | kolaylaştırır | +0.23 | | temelidir | −0.28 | |
|
|
|  |
|
|
| The boundary case is the most informative: the words that pushed a grade-6 text up into grade 7 are long, morphologically heavy Turkish words (*alışkanlıklarımızda*, *yaşamımızdaki*, *cildimizin*). This is direct evidence that the model reads morphological and lexical density as a difficulty signal, and that its single-grade error falls exactly on the kind of neighbouring boundary where the difficulty distinction is genuinely soft. |
|
|