File size: 2,435 Bytes
9d1d392
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language: tr
license: apache-2.0
tags:
  - text-classification
  - educational-content
  - turkish
  - fineweb-edu
  - encoder
  - regression
datasets:
  - YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: boun-tabilab/TabiBERT
pipeline_tag: text-classification
---

# TurkWeb-Edu Classifier V4 🇹🇷

Fast, accurate Turkish educational content classifier. Predicts educational quality scores (0-5) for Turkish web text.

**This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).**
changed lr
## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YsK-dev/TurkWeb-Edu-Classifier-V5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()

print(f"Score: {score:.2f}")
print(f"Educational: {score >= 3}")
```

## Model Details

| Component | Details |
|-----------|---------|
| Base Model | `boun-tabilab/TabiBERT` |
| Architecture | Encoder + Regression Head |
| Training Data | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
| Teacher | Qwen3-30B-A3B-Instruct-2507 |
| Task | Regression (0-5 educational quality score) |
| Language | Turkish (tur_Latn) |

## Evaluation

| Metric | Value |
|--------|-------|
| Pearson | 0.8406999707221985 |
| RMSE | 0.8725 |
| MAE | 0.6240000128746033 |
| F1 (edu≥3) | 0.7221 |
| Exact Accuracy | 0.5152 |

## Scoring Rubric

| Score | Meaning |
|-------|---------|
| 0 | Not Educational — Spam, ads, NSFW, navigation-only |
| 1 | Low Quality — Personal chat, forum posts, low-quality news |
| 2 | Medium — General culture, blog, opinion pieces |
| 3 | Educational — Encyclopedic, how-to guides, concept explanations |
| 4 | High Quality — Well-structured, high pedagogical value, technical |
| 5 | Academic — Textbook quality, sourced, in-depth analysis |

## Recommended Threshold

For filtering educational Turkish content, use `score >= 3` (following FineWeb-Edu methodology).