KOZTAM / README.md
goerkemsaylam's picture
Add KOZTAM model card, weights, and figures
327d870 verified
|
Raw
History Blame Contribute Delete
10.9 kB
---
language:
- tr
license: mit
library_name: pytorch
pipeline_tag: text-classification
base_model: dbmdz/bert-base-turkish-cased
datasets:
- behAIvNET/KOZTAM
tags:
- turkish
- education
- special-education
- K-8
- reading
- reading-difficulty
- reading-texts
- synthetic-dataset
- hearing-loss
- cochlear
- bert
- berturk
- coral
- ordinal-regression
- xai
- integrated-gradients
metrics:
- mae
- accuracy
---
# KOZTAM — Koklear Okuma Zorluğu Tahminleme Modeli
**KOZTAM** (*Cochlear Reading-Difficulty Forecasting Model*) grades a Turkish text on an eight-level reading-difficulty scale (grades 1–8) with a frozen BERTurk encoder and a lightweight CORAL ordinal head, reaching **0.33 ordinal MAE**, **0.72 exact-grade accuracy**, and **0.96 within-one-grade accuracy** on held-out test data.
## Data
KOZTAM is trained on the [`behAIvNET/KOZTAM`](https://huggingface.co/datasets/behAIvNET/KOZTAM) dataset — **1,588 synthetic Turkish texts** spanning grades 1–8 in two categories (informative / narrative), generated under strict MEB-aligned readability targets. One exact-duplicate text (present under both a grade-3 and a grade-5 folder) was removed to eliminate a conflicting ordinal label, giving the final 1,588.
The texts are partitioned into **1,110 training / 239 validation / 239 test** by two-stage stratified sampling over the 16 grade × category strata, so both the ordinal grade distribution and the ≈ 50/50 category balance are preserved in every split. Because the BERTurk backbone is frozen, each text's `[CLS]` embedding (768-d) is pre-computed once and cached, and only the head is trained on those cached vectors; the mean `[CLS]` norm is ≈ 25.2 and identical across the three splits, indicating no distributional shift between them.
A structural property worth noting for modelling: realized text length rises from grade 1 through grade 7 but **drops at grade 8** (shorter, in both categories, than grades 5–7). Length is therefore not a monotone proxy for reading grade at the top of the scale, so the model must rely on lexical and syntactic density rather than on length.
## Model
```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "dbmdz/bert-base-turkish-cased"
MAX_LEN = 512
K = 8
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE).eval()
for p in bert.parameters():
p.requires_grad_(False)
@torch.no_grad()
def encode_cls(texts):
enc = tokenizer(texts, padding=True, truncation=True,
max_length=MAX_LEN, return_tensors="pt").to(DEVICE)
return bert(**enc).last_hidden_state[:, 0, :]
class CoralHead(nn.Module):
def __init__(self, in_dim, num_classes):
super().__init__()
self.fc = nn.Linear(in_dim, 1, bias=False)
self.bias = nn.Parameter(torch.zeros(num_classes - 1))
def forward(self, x):
return self.fc(x) + self.bias
class KOZTAMNet(nn.Module):
def __init__(self, in_dim=768, p=0.2):
super().__init__()
self.proj = nn.Sequential(
nn.Linear(in_dim, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(p),
nn.Linear(256, 64))
self.ordinal = CoralHead(64, K)
self.category = nn.Linear(64, 1)
def forward(self, x):
z = self.proj(x)
return z, self.ordinal(z), self.category(z).squeeze(-1)
ckpt = torch.load("KOZTAM.pt", map_location=DEVICE)
head = KOZTAMNet().to(DEVICE)
head.load_state_dict(ckpt["state_dict"])
head.eval()
thresholds = torch.tensor(ckpt["coral_thresholds"], device=DEVICE)
CATEGORY = {0: "bilgilendirici (informative)", 1: "öyküleyici (narrative)"}
@torch.no_grad()
def predict(texts):
single = isinstance(texts, str)
cls = encode_cls([texts] if single else list(texts))
z = head.proj(cls)
score = head.ordinal.fc(z).squeeze(-1)
grade = (score[:, None] > thresholds[None, :]).sum(1) + 1
cat = (torch.sigmoid(head.category(z).squeeze(-1)) > 0.5).long()
out = [{"grade": int(g), "score": round(float(s), 4), "category": CATEGORY[int(c)]}
for g, s, c in zip(grade, score, cat)]
return out[0] if single else out
print(predict("Sample text."))
```
## Performance
Evaluated on the held-out **test set (239 texts)** the model never saw during training.
**Headline metrics (calibrated).**
| Metric | Value |
|---|---|
| Ordinal MAE | **0.326** |
| Exact-grade accuracy | **0.715** |
| ±1-grade accuracy | **0.958** |
| Spearman ρ (predicted vs. true grade) | **0.965** |
| Text-category accuracy | **1.000** |
**Threshold calibration.** KOZTAM has two parts: the trained network, which outputs a continuous *difficulty score*, and seven **calibrated thresholds** that cut that score into discrete grades (stored in the checkpoint, applied at inference). The raw network already ranks texts near-perfectly — continuous-score Spearman ρ = **0.966** — but the default 0.5-thresholds mis-placed the cut points and collapsed predictions toward the extreme grades. Re-placing the thresholds on the validation split (coordinate descent on MAE, **no re-training**) produced the final metrics:
| Metric (test) | Raw (0.5-threshold) | Calibrated |
|---|---|---|
| Ordinal MAE | 0.837 | 0.326 |
| Exact accuracy | 0.423 | 0.715 |
| ±1 accuracy | 0.782 | 0.958 |
| Spearman ρ | 0.924 | 0.965 |
On the validation split the same procedure moved MAE from 0.837 to 0.285, confirming the calibration generalizes rather than overfitting the test set.
**Continuous difficulty score by grade (test).** The mean score increases monotonically across all eight grades with no inversion, which is the direct evidence that the encoder learned the reading-difficulty ordering:
| Grade | Mean score ± SD |
|---|---|
| 1 | −6.91 ± 1.18 |
| 2 | −2.24 ± 1.52 |
| 3 | −0.99 ± 0.48 |
| 4 | −0.33 ± 0.45 |
| 5 | 0.15 ± 0.29 |
| 6 | 0.73 ± 0.50 |
| 7 | 1.43 ± 0.63 |
| 8 | 4.07 ± 1.49 |
**Per-grade MAE (calibrated, test).**
| Grade | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| MAE | 0.033 | 0.133 | 0.333 | 0.690 | 0.400 | 0.367 | 0.567 | 0.100 |
The model is sharpest at the extremes (grades 1, 8) and at distinct mid-levels; its largest residual errors sit at the **3–4 and 6–7 boundaries**, where texts are genuinely close in readability. On the raw predictions the two categories were balanced (informative 0.832 / narrative 0.842) and truncated texts (>512 tokens, n = 12) were only marginally harder (1.000 vs. 0.828), so neither category nor truncation is a source of systematic error.
**Confusion matrix (calibrated, test).** Rows = true grade, columns = predicted grade.
| true \ pred | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| **1** | 29 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| **2** | 3 | 26 | 1 | 0 | 0 | 0 | 0 | 0 |
| **3** | 0 | 9 | 20 | 1 | 0 | 0 | 0 | 0 |
| **4** | 0 | 1 | 11 | 12 | 3 | 2 | 0 | 0 |
| **5** | 0 | 0 | 1 | 5 | 19 | 5 | 0 | 0 |
| **6** | 0 | 0 | 0 | 2 | 4 | 22 | 1 | 1 |
| **7** | 0 | 0 | 0 | 0 | 2 | 8 | 15 | 5 |
| **8** | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 28 |
![Training dynamics — loss and validation MAE/accuracy over epochs](training_curves.png)
![Confusion matrix — test set](confusion_matrix.png)
**Ablation (test).** Each loss component was removed in turn, with everything else — seed, initialization, batch order, scheduler, early stopping, and the same post-hoc threshold calibration — held identical.
| Variant | Pure ρ | MAE | Exact acc | ±1 acc |
|---|---|---|---|---|
| Full (ordinal + triplet + category) | 0.966 | 0.326 | 0.715 | 0.958 |
| − triplet | 0.963 | 0.402 | 0.636 | 0.962 |
| − category | 0.964 | 0.410 | 0.636 | 0.954 |
| Ordinal only | 0.962 | 0.410 | 0.628 | 0.962 |
Removing the ordinal-aware triplet loss raises MAE by 0.075 while pure ranking ρ is essentially unchanged, showing the triplet term improves the *even spacing* of the embedding axis (its calibratability) rather than the ranking itself.
## Explainability (XAI)
Reading-difficulty attributions are produced with **Integrated Gradients (IG)** on an end-to-end version of the model (frozen BERTurk → continuous difficulty score), with token attributions merged back to whole words. Before attribution, the end-to-end score is verified against the cached score of the same text — they match to within **1.1 × 10⁻⁵**, confirming IG explains the exact deployed model. IG is run with `n_steps = 50`, and the convergence delta is 0.056 (negligible relative to the score magnitude), so the completeness axiom holds.
Positive attribution means a word pushes the difficulty score **up** (harder); negative means it pushes **down** (easier). Three representative test cases are shown below. Because IG distributes each text's total score relative to an empty-content baseline, the absolute magnitude of the attributions scales with how far a text's score sits from that baseline — so magnitudes are comparable *within* a text, not across texts.
**Case 1 — clear-easy (true grade 1, predicted 1).**
| Increases difficulty | value | | Decreases difficulty | value |
|---|---|---|---|---|
| elmanın | +0.54 | | neşeyle | −0.27 |
| o | +0.33 | | güzelce | −0.26 |
| kuralımızdır | +0.27 | | denizi | −0.22 |
| Dünyadaki | +0.24 | | yapıp | −0.20 |
| yansıtır | +0.23 | | üç | −0.16 |
![XAI word heatmap — clear-easy case](xai_easy.png)
**Case 2 — clear-hard (true grade 7, predicted 7).**
| Increases difficulty | value | | Decreases difficulty | value |
|---|---|---|---|---|
| ilerlerler | +0.26 | | kıyafetlerin | −0.06 |
| abartıya | +0.13 | | hızlı | −0.05 |
| ufkunu | +0.07 | | Kendi | −0.05 |
| koruyanlar | +0.06 | | Kendi | −0.04 |
| sadeliği | +0.06 | | | |
![XAI word heatmap — clear-hard case](xai_hard.png)
**Case 3 — boundary error (true grade 6, predicted 7).**
| Increases difficulty | value | | Decreases difficulty | value |
|---|---|---|---|---|
| alışkanlıklarımızda | +0.69 | | temizliğimizi | −0.66 |
| yaşamımızdaki | +0.60 | | ve | −0.31 |
| cildimizin | +0.51 | | seçmesi | −0.28 |
| kolaylaştırır | +0.23 | | temelidir | −0.28 |
![XAI word heatmap — boundary-error case](xai_boundary.png)
The boundary case is the most informative: the words that pushed a grade-6 text up into grade 7 are long, morphologically heavy Turkish words (*alışkanlıklarımızda*, *yaşamımızdaki*, *cildimizin*). This is direct evidence that the model reads morphological and lexical density as a difficulty signal, and that its single-grade error falls exactly on the kind of neighbouring boundary where the difficulty distinction is genuinely soft.