README.md · behAIvNET/KOZTAM at main

KOZTAM / README.md

goerkemsaylam

Add KOZTAM model card, weights, and figures

327d870 verified 2 days ago

preview code

Raw

History Blame Contribute Delete

10.9 kB

	---
	language:
	- tr
	license: mit
	library_name: pytorch
	pipeline_tag: text-classification
	base_model: dbmdz/bert-base-turkish-cased
	datasets:
	- behAIvNET/KOZTAM
	tags:
	- turkish
	- education
	- special-education
	- K-8
	- reading
	- reading-difficulty
	- reading-texts
	- synthetic-dataset
	- hearing-loss
	- cochlear
	- bert
	- berturk
	- coral
	- ordinal-regression
	- xai
	- integrated-gradients
	metrics:
	- mae
	- accuracy
	---

	# KOZTAM — Koklear Okuma Zorluğu Tahminleme Modeli

	KOZTAM (Cochlear Reading-Difficulty Forecasting Model) grades a Turkish text on an eight-level reading-difficulty scale (grades 1–8) with a frozen BERTurk encoder and a lightweight CORAL ordinal head, reaching 0.33 ordinal MAE, 0.72 exact-grade accuracy, and 0.96 within-one-grade accuracy on held-out test data.

	## Data

	KOZTAM is trained on the [`behAIvNET/KOZTAM`](https://huggingface.co/datasets/behAIvNET/KOZTAM) dataset — 1,588 synthetic Turkish texts spanning grades 1–8 in two categories (informative / narrative), generated under strict MEB-aligned readability targets. One exact-duplicate text (present under both a grade-3 and a grade-5 folder) was removed to eliminate a conflicting ordinal label, giving the final 1,588.

	The texts are partitioned into 1,110 training / 239 validation / 239 test by two-stage stratified sampling over the 16 grade × category strata, so both the ordinal grade distribution and the ≈ 50/50 category balance are preserved in every split. Because the BERTurk backbone is frozen, each text's `[CLS]` embedding (768-d) is pre-computed once and cached, and only the head is trained on those cached vectors; the mean `[CLS]` norm is ≈ 25.2 and identical across the three splits, indicating no distributional shift between them.

	A structural property worth noting for modelling: realized text length rises from grade 1 through grade 7 but drops at grade 8 (shorter, in both categories, than grades 5–7). Length is therefore not a monotone proxy for reading grade at the top of the scale, so the model must rely on lexical and syntactic density rather than on length.

	## Model

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoTokenizer, AutoModel

	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
	MODEL_NAME = "dbmdz/bert-base-turkish-cased"
	MAX_LEN = 512
	K = 8

	tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
	bert = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE).eval()
	for p in bert.parameters():
	p.requires_grad_(False)

	@torch.no_grad()
	def encode_cls(texts):
	enc = tokenizer(texts, padding=True, truncation=True,
	max_length=MAX_LEN, return_tensors="pt").to(DEVICE)
	return bert(**enc).last_hidden_state[:, 0, :]

	class CoralHead(nn.Module):
	def __init__(self, in_dim, num_classes):
	super().__init__()
	self.fc = nn.Linear(in_dim, 1, bias=False)
	self.bias = nn.Parameter(torch.zeros(num_classes - 1))
	def forward(self, x):
	return self.fc(x) + self.bias

	class KOZTAMNet(nn.Module):
	def __init__(self, in_dim=768, p=0.2):
	super().__init__()
	self.proj = nn.Sequential(
	nn.Linear(in_dim, 256), nn.LayerNorm(256), nn.GELU(), nn.Dropout(p),
	nn.Linear(256, 64))
	self.ordinal = CoralHead(64, K)
	self.category = nn.Linear(64, 1)
	def forward(self, x):
	z = self.proj(x)
	return z, self.ordinal(z), self.category(z).squeeze(-1)

	ckpt = torch.load("KOZTAM.pt", map_location=DEVICE)
	head = KOZTAMNet().to(DEVICE)
	head.load_state_dict(ckpt["state_dict"])
	head.eval()
	thresholds = torch.tensor(ckpt["coral_thresholds"], device=DEVICE)

	CATEGORY = {0: "bilgilendirici (informative)", 1: "öyküleyici (narrative)"}

	@torch.no_grad()
	def predict(texts):
	single = isinstance(texts, str)
	cls = encode_cls([texts] if single else list(texts))
	z = head.proj(cls)
	score = head.ordinal.fc(z).squeeze(-1)
	grade = (score[:, None] > thresholds[None, :]).sum(1) + 1
	cat = (torch.sigmoid(head.category(z).squeeze(-1)) > 0.5).long()
	out = [{"grade": int(g), "score": round(float(s), 4), "category": CATEGORY[int(c)]}
	for g, s, c in zip(grade, score, cat)]
	return out[0] if single else out

	print(predict("Sample text."))
	```

	## Performance

	Evaluated on the held-out test set (239 texts) the model never saw during training.

	Headline metrics (calibrated).

	\| Metric \| Value \|
	\|---\|---\|
	\| Ordinal MAE \| 0.326 \|
	\| Exact-grade accuracy \| 0.715 \|
	\| ±1-grade accuracy \| 0.958 \|
	\| Spearman ρ (predicted vs. true grade) \| 0.965 \|
	\| Text-category accuracy \| 1.000 \|

	Threshold calibration. KOZTAM has two parts: the trained network, which outputs a continuous difficulty score, and seven calibrated thresholds that cut that score into discrete grades (stored in the checkpoint, applied at inference). The raw network already ranks texts near-perfectly — continuous-score Spearman ρ = 0.966 — but the default 0.5-thresholds mis-placed the cut points and collapsed predictions toward the extreme grades. Re-placing the thresholds on the validation split (coordinate descent on MAE, no re-training) produced the final metrics:

	\| Metric (test) \| Raw (0.5-threshold) \| Calibrated \|
	\|---\|---\|---\|
	\| Ordinal MAE \| 0.837 \| 0.326 \|
	\| Exact accuracy \| 0.423 \| 0.715 \|
	\| ±1 accuracy \| 0.782 \| 0.958 \|
	\| Spearman ρ \| 0.924 \| 0.965 \|

	On the validation split the same procedure moved MAE from 0.837 to 0.285, confirming the calibration generalizes rather than overfitting the test set.

	Continuous difficulty score by grade (test). The mean score increases monotonically across all eight grades with no inversion, which is the direct evidence that the encoder learned the reading-difficulty ordering:

	\| Grade \| Mean score ± SD \|
	\|---\|---\|
	\| 1 \| −6.91 ± 1.18 \|
	\| 2 \| −2.24 ± 1.52 \|
	\| 3 \| −0.99 ± 0.48 \|
	\| 4 \| −0.33 ± 0.45 \|
	\| 5 \| 0.15 ± 0.29 \|
	\| 6 \| 0.73 ± 0.50 \|
	\| 7 \| 1.43 ± 0.63 \|
	\| 8 \| 4.07 ± 1.49 \|

	Per-grade MAE (calibrated, test).

	\| Grade \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| MAE \| 0.033 \| 0.133 \| 0.333 \| 0.690 \| 0.400 \| 0.367 \| 0.567 \| 0.100 \|

	The model is sharpest at the extremes (grades 1, 8) and at distinct mid-levels; its largest residual errors sit at the 3–4 and 6–7 boundaries, where texts are genuinely close in readability. On the raw predictions the two categories were balanced (informative 0.832 / narrative 0.842) and truncated texts (>512 tokens, n = 12) were only marginally harder (1.000 vs. 0.828), so neither category nor truncation is a source of systematic error.

	Confusion matrix (calibrated, test). Rows = true grade, columns = predicted grade.

	\| true \ pred \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| 8 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| 1 \| 29 \| 1 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \|
	\| 2 \| 3 \| 26 \| 1 \| 0 \| 0 \| 0 \| 0 \| 0 \|
	\| 3 \| 0 \| 9 \| 20 \| 1 \| 0 \| 0 \| 0 \| 0 \|
	\| 4 \| 0 \| 1 \| 11 \| 12 \| 3 \| 2 \| 0 \| 0 \|
	\| 5 \| 0 \| 0 \| 1 \| 5 \| 19 \| 5 \| 0 \| 0 \|
	\| 6 \| 0 \| 0 \| 0 \| 2 \| 4 \| 22 \| 1 \| 1 \|
	\| 7 \| 0 \| 0 \| 0 \| 0 \| 2 \| 8 \| 15 \| 5 \|
	\| 8 \| 0 \| 0 \| 0 \| 0 \| 0 \| 1 \| 1 \| 28 \|

	![Training dynamics — loss and validation MAE/accuracy over epochs](training_curves.png)

	![Confusion matrix — test set](confusion_matrix.png)

	Ablation (test). Each loss component was removed in turn, with everything else — seed, initialization, batch order, scheduler, early stopping, and the same post-hoc threshold calibration — held identical.

	\| Variant \| Pure ρ \| MAE \| Exact acc \| ±1 acc \|
	\|---\|---\|---\|---\|---\|
	\| Full (ordinal + triplet + category) \| 0.966 \| 0.326 \| 0.715 \| 0.958 \|
	\| − triplet \| 0.963 \| 0.402 \| 0.636 \| 0.962 \|
	\| − category \| 0.964 \| 0.410 \| 0.636 \| 0.954 \|
	\| Ordinal only \| 0.962 \| 0.410 \| 0.628 \| 0.962 \|

	Removing the ordinal-aware triplet loss raises MAE by 0.075 while pure ranking ρ is essentially unchanged, showing the triplet term improves the even spacing of the embedding axis (its calibratability) rather than the ranking itself.

	## Explainability (XAI)

	Reading-difficulty attributions are produced with Integrated Gradients (IG) on an end-to-end version of the model (frozen BERTurk → continuous difficulty score), with token attributions merged back to whole words. Before attribution, the end-to-end score is verified against the cached score of the same text — they match to within 1.1 × 10⁻⁵, confirming IG explains the exact deployed model. IG is run with `n_steps = 50`, and the convergence delta is 0.056 (negligible relative to the score magnitude), so the completeness axiom holds.

	Positive attribution means a word pushes the difficulty score up (harder); negative means it pushes down (easier). Three representative test cases are shown below. Because IG distributes each text's total score relative to an empty-content baseline, the absolute magnitude of the attributions scales with how far a text's score sits from that baseline — so magnitudes are comparable within a text, not across texts.

	Case 1 — clear-easy (true grade 1, predicted 1).

	\| Increases difficulty \| value \| \| Decreases difficulty \| value \|
	\|---\|---\|---\|---\|---\|
	\| elmanın \| +0.54 \| \| neşeyle \| −0.27 \|
	\| o \| +0.33 \| \| güzelce \| −0.26 \|
	\| kuralımızdır \| +0.27 \| \| denizi \| −0.22 \|
	\| Dünyadaki \| +0.24 \| \| yapıp \| −0.20 \|
	\| yansıtır \| +0.23 \| \| üç \| −0.16 \|

	![XAI word heatmap — clear-easy case](xai_easy.png)

	Case 2 — clear-hard (true grade 7, predicted 7).

	\| Increases difficulty \| value \| \| Decreases difficulty \| value \|
	\|---\|---\|---\|---\|---\|
	\| ilerlerler \| +0.26 \| \| kıyafetlerin \| −0.06 \|
	\| abartıya \| +0.13 \| \| hızlı \| −0.05 \|
	\| ufkunu \| +0.07 \| \| Kendi \| −0.05 \|
	\| koruyanlar \| +0.06 \| \| Kendi \| −0.04 \|
	\| sadeliği \| +0.06 \| \| \| \|

	![XAI word heatmap — clear-hard case](xai_hard.png)

	Case 3 — boundary error (true grade 6, predicted 7).

	\| Increases difficulty \| value \| \| Decreases difficulty \| value \|
	\|---\|---\|---\|---\|---\|
	\| alışkanlıklarımızda \| +0.69 \| \| temizliğimizi \| −0.66 \|
	\| yaşamımızdaki \| +0.60 \| \| ve \| −0.31 \|
	\| cildimizin \| +0.51 \| \| seçmesi \| −0.28 \|
	\| kolaylaştırır \| +0.23 \| \| temelidir \| −0.28 \|

	![XAI word heatmap — boundary-error case](xai_boundary.png)

	The boundary case is the most informative: the words that pushed a grade-6 text up into grade 7 are long, morphologically heavy Turkish words (alışkanlıklarımızda, yaşamımızdaki, cildimizin). This is direct evidence that the model reads morphological and lexical density as a difficulty signal, and that its single-grade error falls exactly on the kind of neighbouring boundary where the difficulty distinction is genuinely soft.