KOZTAM / README.md
goerkemsaylam's picture
Add KOZTAM model card, weights, and figures
327d870 verified
|
Raw
History Blame Contribute Delete
10.9 kB
metadata
language:
  - tr
license: mit
library_name: pytorch
pipeline_tag: text-classification
base_model: dbmdz/bert-base-turkish-cased
datasets:
  - behAIvNET/KOZTAM
tags:
  - turkish
  - education
  - special-education
  - K-8
  - reading
  - reading-difficulty
  - reading-texts
  - synthetic-dataset
  - hearing-loss
  - cochlear
  - bert
  - berturk
  - coral
  - ordinal-regression
  - xai
  - integrated-gradients
metrics:
  - mae
  - accuracy

KOZTAM — Koklear Okuma Zorluğu Tahminleme Modeli

KOZTAM (Cochlear Reading-Difficulty Forecasting Model) grades a Turkish text on an eight-level reading-difficulty scale (grades 1–8) with a frozen BERTurk encoder and a lightweight CORAL ordinal head, reaching 0.33 ordinal MAE, 0.72 exact-grade accuracy, and 0.96 within-one-grade accuracy on held-out test data.

Data

KOZTAM is trained on the behAIvNET/KOZTAM dataset — 1,588 synthetic Turkish texts spanning grades 1–8 in two categories (informative / narrative), generated under strict MEB-aligned readability targets. One exact-duplicate text (present under both a grade-3 and a grade-5 folder) was removed to eliminate a conflicting ordinal label, giving the final 1,588.

The texts are partitioned into 1,110 training / 239 validation / 239 test by two-stage stratified sampling over the 16 grade × category strata, so both the ordinal grade distribution and the ≈ 50/50 category balance are preserved in every split. Because the BERTurk backbone is frozen, each text's [CLS] embedding (768-d) is pre-computed once and cached, and only the head is trained on those cached vectors; the mean [CLS] norm is ≈ 25.2 and identical across the three splits, indicating no distributional shift between them.

A structural property worth noting for modelling: realized text length rises from grade 1 through grade 7 but drops at grade 8 (shorter, in both categories, than grades 5–7). Length is therefore not a monotone proxy for reading grade at the top of the scale, so the model must rely on lexical and syntactic density rather than on length.

Model

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

DEVICE     = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "dbmdz/bert-base-turkish-cased"
MAX_LEN    = 512
K          = 8                     

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE).eval()
for p in bert.parameters():
    p.requires_grad_(False)

@torch.no_grad()
def encode_cls(texts):
    enc = tokenizer(texts, padding=True, truncation=True,
                    max_length=MAX_LEN, return_tensors="pt").to(DEVICE)
    return bert(**enc).last_hidden_state[:, 0, :]

class CoralHead(nn.Module):
    def __init__(self, in_dim, num_classes):
        super().__init__()
        self.fc   = nn.Linear(in_dim, 1, bias=False)
        self.bias = nn.Parameter(torch.zeros(num_classes - 1))
    def forward(self, x):
        return self.fc(x) + self.bias

class KOZTAMNet(nn.Module):
    def __init__(self, in_dim=768, p=0.2):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Linear(in_dim, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(p),
            nn.Linear(256, 64))
        self.ordinal  = CoralHead(64, K)    
        self.category = nn.Linear(64, 1)    
    def forward(self, x):
        z = self.proj(x)
        return z, self.ordinal(z), self.category(z).squeeze(-1)

ckpt = torch.load("KOZTAM.pt", map_location=DEVICE)
head = KOZTAMNet().to(DEVICE)
head.load_state_dict(ckpt["state_dict"])
head.eval()
thresholds = torch.tensor(ckpt["coral_thresholds"], device=DEVICE)   

CATEGORY = {0: "bilgilendirici (informative)", 1: "öyküleyici (narrative)"}

@torch.no_grad()
def predict(texts):
    single = isinstance(texts, str)
    cls   = encode_cls([texts] if single else list(texts))
    z     = head.proj(cls)
    score = head.ordinal.fc(z).squeeze(-1)                       
    grade = (score[:, None] > thresholds[None, :]).sum(1) + 1     
    cat   = (torch.sigmoid(head.category(z).squeeze(-1)) > 0.5).long()
    out = [{"grade": int(g), "score": round(float(s), 4), "category": CATEGORY[int(c)]}
           for g, s, c in zip(grade, score, cat)]
    return out[0] if single else out

print(predict("Sample text."))

Performance

Evaluated on the held-out test set (239 texts) the model never saw during training.

Headline metrics (calibrated).

Metric Value
Ordinal MAE 0.326
Exact-grade accuracy 0.715
±1-grade accuracy 0.958
Spearman ρ (predicted vs. true grade) 0.965
Text-category accuracy 1.000

Threshold calibration. KOZTAM has two parts: the trained network, which outputs a continuous difficulty score, and seven calibrated thresholds that cut that score into discrete grades (stored in the checkpoint, applied at inference). The raw network already ranks texts near-perfectly — continuous-score Spearman ρ = 0.966 — but the default 0.5-thresholds mis-placed the cut points and collapsed predictions toward the extreme grades. Re-placing the thresholds on the validation split (coordinate descent on MAE, no re-training) produced the final metrics:

Metric (test) Raw (0.5-threshold) Calibrated
Ordinal MAE 0.837 0.326
Exact accuracy 0.423 0.715
±1 accuracy 0.782 0.958
Spearman ρ 0.924 0.965

On the validation split the same procedure moved MAE from 0.837 to 0.285, confirming the calibration generalizes rather than overfitting the test set.

Continuous difficulty score by grade (test). The mean score increases monotonically across all eight grades with no inversion, which is the direct evidence that the encoder learned the reading-difficulty ordering:

Grade Mean score ± SD
1 −6.91 ± 1.18
2 −2.24 ± 1.52
3 −0.99 ± 0.48
4 −0.33 ± 0.45
5 0.15 ± 0.29
6 0.73 ± 0.50
7 1.43 ± 0.63
8 4.07 ± 1.49

Per-grade MAE (calibrated, test).

Grade 1 2 3 4 5 6 7 8
MAE 0.033 0.133 0.333 0.690 0.400 0.367 0.567 0.100

The model is sharpest at the extremes (grades 1, 8) and at distinct mid-levels; its largest residual errors sit at the 3–4 and 6–7 boundaries, where texts are genuinely close in readability. On the raw predictions the two categories were balanced (informative 0.832 / narrative 0.842) and truncated texts (>512 tokens, n = 12) were only marginally harder (1.000 vs. 0.828), so neither category nor truncation is a source of systematic error.

Confusion matrix (calibrated, test). Rows = true grade, columns = predicted grade.

true \ pred 1 2 3 4 5 6 7 8
1 29 1 0 0 0 0 0 0
2 3 26 1 0 0 0 0 0
3 0 9 20 1 0 0 0 0
4 0 1 11 12 3 2 0 0
5 0 0 1 5 19 5 0 0
6 0 0 0 2 4 22 1 1
7 0 0 0 0 2 8 15 5
8 0 0 0 0 0 1 1 28

Training dynamics — loss and validation MAE/accuracy over epochs

Confusion matrix — test set

Ablation (test). Each loss component was removed in turn, with everything else — seed, initialization, batch order, scheduler, early stopping, and the same post-hoc threshold calibration — held identical.

Variant Pure ρ MAE Exact acc ±1 acc
Full (ordinal + triplet + category) 0.966 0.326 0.715 0.958
− triplet 0.963 0.402 0.636 0.962
− category 0.964 0.410 0.636 0.954
Ordinal only 0.962 0.410 0.628 0.962

Removing the ordinal-aware triplet loss raises MAE by 0.075 while pure ranking ρ is essentially unchanged, showing the triplet term improves the even spacing of the embedding axis (its calibratability) rather than the ranking itself.

Explainability (XAI)

Reading-difficulty attributions are produced with Integrated Gradients (IG) on an end-to-end version of the model (frozen BERTurk → continuous difficulty score), with token attributions merged back to whole words. Before attribution, the end-to-end score is verified against the cached score of the same text — they match to within 1.1 × 10⁻⁵, confirming IG explains the exact deployed model. IG is run with n_steps = 50, and the convergence delta is 0.056 (negligible relative to the score magnitude), so the completeness axiom holds.

Positive attribution means a word pushes the difficulty score up (harder); negative means it pushes down (easier). Three representative test cases are shown below. Because IG distributes each text's total score relative to an empty-content baseline, the absolute magnitude of the attributions scales with how far a text's score sits from that baseline — so magnitudes are comparable within a text, not across texts.

Case 1 — clear-easy (true grade 1, predicted 1).

Increases difficulty value Decreases difficulty value
elmanın +0.54 neşeyle −0.27
o +0.33 güzelce −0.26
kuralımızdır +0.27 denizi −0.22
Dünyadaki +0.24 yapıp −0.20
yansıtır +0.23 üç −0.16

XAI word heatmap — clear-easy case

Case 2 — clear-hard (true grade 7, predicted 7).

Increases difficulty value Decreases difficulty value
ilerlerler +0.26 kıyafetlerin −0.06
abartıya +0.13 hızlı −0.05
ufkunu +0.07 Kendi −0.05
koruyanlar +0.06 Kendi −0.04
sadeliği +0.06

XAI word heatmap — clear-hard case

Case 3 — boundary error (true grade 6, predicted 7).

Increases difficulty value Decreases difficulty value
alışkanlıklarımızda +0.69 temizliğimizi −0.66
yaşamımızdaki +0.60 ve −0.31
cildimizin +0.51 seçmesi −0.28
kolaylaştırır +0.23 temelidir −0.28

XAI word heatmap — boundary-error case

The boundary case is the most informative: the words that pushed a grade-6 text up into grade 7 are long, morphologically heavy Turkish words (alışkanlıklarımızda, yaşamımızdaki, cildimizin). This is direct evidence that the model reads morphological and lexical density as a difficulty signal, and that its single-grade error falls exactly on the kind of neighbouring boundary where the difficulty distinction is genuinely soft.