Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: tr
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
tags:
|
| 5 |
+
- text-classification
|
| 6 |
+
- educational-content
|
| 7 |
+
- turkish
|
| 8 |
+
- fineweb-edu
|
| 9 |
+
- encoder
|
| 10 |
+
- regression
|
| 11 |
+
datasets:
|
| 12 |
+
- YsK-dev/TurkWeb-Edu-AnnotationsV3
|
| 13 |
+
base_model: boun-tabilab/TabiBERT
|
| 14 |
+
pipeline_tag: text-classification
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# TurkWeb-Edu Classifier V4 🇹🇷
|
| 18 |
+
|
| 19 |
+
Fast, accurate Turkish educational content classifier. Predicts educational quality scores (0-5) for Turkish web text.
|
| 20 |
+
|
| 21 |
+
**This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).**
|
| 22 |
+
changed lr
|
| 23 |
+
## Usage
|
| 24 |
+
|
| 25 |
+
```python
|
| 26 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 27 |
+
import torch
|
| 28 |
+
|
| 29 |
+
model_name = "YsK-dev/TurkWeb-Edu-Classifier-V5"
|
| 30 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 31 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 32 |
+
|
| 33 |
+
text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."
|
| 34 |
+
|
| 35 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
|
| 36 |
+
with torch.no_grad():
|
| 37 |
+
score = model(**inputs).logits.squeeze().item()
|
| 38 |
+
|
| 39 |
+
print(f"Score: {score:.2f}")
|
| 40 |
+
print(f"Educational: {score >= 3}")
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
## Model Details
|
| 44 |
+
|
| 45 |
+
| Component | Details |
|
| 46 |
+
|-----------|---------|
|
| 47 |
+
| Base Model | `boun-tabilab/TabiBERT` |
|
| 48 |
+
| Architecture | Encoder + Regression Head |
|
| 49 |
+
| Training Data | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
|
| 50 |
+
| Teacher | Qwen3-30B-A3B-Instruct-2507 |
|
| 51 |
+
| Task | Regression (0-5 educational quality score) |
|
| 52 |
+
| Language | Turkish (tur_Latn) |
|
| 53 |
+
|
| 54 |
+
## Evaluation
|
| 55 |
+
|
| 56 |
+
| Metric | Value |
|
| 57 |
+
|--------|-------|
|
| 58 |
+
| Pearson | 0.8406999707221985 |
|
| 59 |
+
| RMSE | 0.8725 |
|
| 60 |
+
| MAE | 0.6240000128746033 |
|
| 61 |
+
| F1 (edu≥3) | 0.7221 |
|
| 62 |
+
| Exact Accuracy | 0.5152 |
|
| 63 |
+
|
| 64 |
+
## Scoring Rubric
|
| 65 |
+
|
| 66 |
+
| Score | Meaning |
|
| 67 |
+
|-------|---------|
|
| 68 |
+
| 0 | Not Educational — Spam, ads, NSFW, navigation-only |
|
| 69 |
+
| 1 | Low Quality — Personal chat, forum posts, low-quality news |
|
| 70 |
+
| 2 | Medium — General culture, blog, opinion pieces |
|
| 71 |
+
| 3 | Educational — Encyclopedic, how-to guides, concept explanations |
|
| 72 |
+
| 4 | High Quality — Well-structured, high pedagogical value, technical |
|
| 73 |
+
| 5 | Academic — Textbook quality, sourced, in-depth analysis |
|
| 74 |
+
|
| 75 |
+
## Recommended Threshold
|
| 76 |
+
|
| 77 |
+
For filtering educational Turkish content, use `score >= 3` (following FineWeb-Edu methodology).
|