YsK-dev
/

TurkWeb-Edu-Classifier-V5

+---
+language: tr
+license: apache-2.0
+tags:
+  - text-classification
+  - educational-content
+  - turkish
+  - fineweb-edu
+  - encoder
+  - regression
+datasets:
+  - YsK-dev/TurkWeb-Edu-AnnotationsV3
+base_model: boun-tabilab/TabiBERT
+pipeline_tag: text-classification
+---
+# TurkWeb-Edu Classifier V4 🇹🇷
+Fast, accurate Turkish educational content classifier. Predicts educational quality scores (0-5) for Turkish web text.
+**This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).**
+changed lr
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "YsK-dev/TurkWeb-Edu-Classifier-V5"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
+with torch.no_grad():
+    score = model(**inputs).logits.squeeze().item()
+print(f"Score: {score:.2f}")
+print(f"Educational: {score >= 3}")
+```
+## Model Details
+| Component | Details |
+|-----------|---------|
+| Base Model | `boun-tabilab/TabiBERT` |
+| Architecture | Encoder + Regression Head |
+| Training Data | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
+| Teacher | Qwen3-30B-A3B-Instruct-2507 |
+| Task | Regression (0-5 educational quality score) |
+| Language | Turkish (tur_Latn) |
+## Evaluation
+| Metric | Value |
+|--------|-------|
+| Pearson | 0.8406999707221985 |
+| RMSE | 0.8725 |
+| MAE | 0.6240000128746033 |
+| F1 (edu≥3) | 0.7221 |
+| Exact Accuracy | 0.5152 |
+## Scoring Rubric
+| Score | Meaning |
+|-------|---------|
+| 0 | Not Educational — Spam, ads, NSFW, navigation-only |
+| 1 | Low Quality — Personal chat, forum posts, low-quality news |
+| 2 | Medium — General culture, blog, opinion pieces |
+| 3 | Educational — Encyclopedic, how-to guides, concept explanations |
+| 4 | High Quality — Well-structured, high pedagogical value, technical |
+| 5 | Academic — Textbook quality, sourced, in-depth analysis |
+## Recommended Threshold
+For filtering educational Turkish content, use `score >= 3` (following FineWeb-Edu methodology).