File size: 2,425 Bytes
a518ca0 c11326b a518ca0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ---
language: tr
license: apache-2.0
tags:
- text-classification
- educational-content
- turkish
- fineweb-edu
- encoder
- regression
datasets:
- YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: boun-tabilab/TabiBERT
pipeline_tag: text-classification
---
# TurkWeb-Edu Classifier V4 🇹🇷
Fast, accurate Turkish educational content classifier. Predicts educational quality scores (0-5) for Turkish web text.
**This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).**
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "YsK-dev/TurkWeb-Edu-Classifier-V4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
with torch.no_grad():
score = model(**inputs).logits.squeeze().item()
print(f"Score: {score:.2f}")
print(f"Educational: {score >= 3}")
```
## Model Details
| Component | Details |
|-----------|---------|
| Base Model | `boun-tabilab/TabiBERT` |
| Architecture | Encoder + Regression Head |
| Training Data | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
| Teacher | Qwen3-30B-A3B-Instruct-2507 |
| Task | Regression (0-5 educational quality score) |
| Language | Turkish (tur_Latn) |
## Evaluation
| Metric | Value |
|--------|-------|
| Pearson | 0.8312000036239624 |
| RMSE | 0.8874 |
| MAE | 0.6416000127792358 |
| F1 (edu≥3) | 0.7197 |
| Exact Accuracy | 0.5044 |
## Scoring Rubric
| Score | Meaning |
|-------|---------|
| 0 | Not Educational — Spam, ads, NSFW, navigation-only |
| 1 | Low Quality — Personal chat, forum posts, low-quality news |
| 2 | Medium — General culture, blog, opinion pieces |
| 3 | Educational — Encyclopedic, how-to guides, concept explanations |
| 4 | High Quality — Well-structured, high pedagogical value, technical |
| 5 | Academic — Textbook quality, sourced, in-depth analysis |
## Recommended Threshold
For filtering educational Turkish content, use `score >= 3` (following FineWeb-Edu methodology).
|