File size: 2,885 Bytes
0c286c2
4f4325e
 
 
 
 
 
 
 
 
 
 
 
0c286c2
 
4f4325e
0c286c2
4f4325e
 
0c286c2
 
 
4f4325e
 
 
 
 
 
 
 
0c286c2
4f4325e
0c286c2
4f4325e
 
 
 
 
 
 
 
0c286c2
4f4325e
0c286c2
4f4325e
 
 
0c286c2
4f4325e
 
 
0c286c2
4f4325e
0c286c2
4f4325e
 
 
0c286c2
4f4325e
 
 
 
0c286c2
 
 
4f4325e
 
 
 
 
 
 
 
0c286c2
4f4325e
0c286c2
4f4325e
 
 
 
0c286c2
4f4325e
0c286c2
4f4325e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
language: tr
license: apache-2.0
tags:
  - text-classification
  - educational-content
  - turkish
  - fineweb-edu
  - qwen3
datasets:
  - YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: Qwen/Qwen3-0.6B-Base
pipeline_tag: text-classification
---

# TurkWeb-Edu Classifier V3 🇹🇷

A **Turkish educational content classifier** that predicts educational quality scores (0-5) for Turkish web text.
This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).

## Model Details

| Component | Details |
|---|---|
| **Base Model** | `Qwen/Qwen3-0.6B-Base` |
| **Architecture** | Qwen3 + Regression Head (LoRA fine-tuned, merged) |
| **Teacher Model** | `Qwen/Qwen3-30B-A3B-Instruct-2507` |
| **Training Data** | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
| **Task** | Regression (0-5 educational quality score) |
| **Language** | Turkish (tur_Latn) |

## Scoring Rubric

| Score | Meaning |
|---|---|
| 0 | **Not Educational** — Spam, ads, NSFW, navigation-only |
| 1 | **Low Quality** — Personal chat, forum posts, low-quality news |
| 2 | **Medium** — General culture, blog, opinion pieces |
| 3 | **Educational** — Encyclopedic, how-to guides, concept explanations |
| 4 | **High Quality** — Well-structured, high pedagogical value, technical |
| 5 | **Academic** — Textbook quality, sourced, in-depth analysis |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YsK-dev/TurkWeb-Edu-Classifier-V3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()

print(f"Score: {score:.2f}")
print(f"Int Score: {int(round(max(0, min(score, 5))))}")
# Expected: High score (4-5) for this educational text about photosynthesis
```

## Evaluation

| Metric | Value |
|---|---|
| MSE | 1.1642 |
| RMSE | 1.0790 |
| MAE | 0.8374 |
| F1 (edu≥3) | 0.7147 |
| F1 (weighted) | 0.3956 |
| Accuracy | 0.3769 |

## Training Pipeline

1. **Teacher Annotation**: Qwen3-30B-A3B annotated 840K Turkish web samples from FineWeb-2 (tur_Latn)
2. **Deduplication**: SHA256 text dedup → 660K unique samples
3. **Student Training**: Qwen3-0.6B-Base + LoRA (r=16) fine-tuned for 3 epochs
4. **Merging**: LoRA weights merged into base model for efficient inference

## Recommended Threshold

For filtering educational Turkish content, use `score >= 3` (following the FineWeb-Edu methodology).