File size: 2,885 Bytes

---
language: tr
license: apache-2.0
tags:
  - text-classification
  - educational-content
  - turkish
  - fineweb-edu
  - qwen3
datasets:
  - YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: Qwen/Qwen3-0.6B-Base
pipeline_tag: text-classification
---

# TurkWeb-Edu Classifier V3 🇹🇷

A **Turkish educational content classifier** that predicts educational quality scores (0-5) for Turkish web text.
This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).

## Model Details

| Component | Details |
|---|---|
| **Base Model** | `Qwen/Qwen3-0.6B-Base` |
| **Architecture** | Qwen3 + Regression Head (LoRA fine-tuned, merged) |
| **Teacher Model** | `Qwen/Qwen3-30B-A3B-Instruct-2507` |
| **Training Data** | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
| **Task** | Regression (0-5 educational quality score) |
| **Language** | Turkish (tur_Latn) |

## Scoring Rubric

| Score | Meaning |
|---|---|
| 0 | **Not Educational** — Spam, ads, NSFW, navigation-only |
| 1 | **Low Quality** — Personal chat, forum posts, low-quality news |
| 2 | **Medium** — General culture, blog, opinion pieces |
| 3 | **Educational** — Encyclopedic, how-to guides, concept explanations |
| 4 | **High Quality** — Well-structured, high pedagogical value, technical |
| 5 | **Academic** — Textbook quality, sourced, in-depth analysis |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YsK-dev/TurkWeb-Edu-Classifier-V3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()

print(f"Score: {score:.2f}")
print(f"Int Score: {int(round(max(0, min(score, 5))))}")
# Expected: High score (4-5) for this educational text about photosynthesis
```

## Evaluation

| Metric | Value |
|---|---|
| MSE | 1.1642 |
| RMSE | 1.0790 |
| MAE | 0.8374 |
| F1 (edu≥3) | 0.7147 |
| F1 (weighted) | 0.3956 |
| Accuracy | 0.3769 |

## Training Pipeline

1. **Teacher Annotation**: Qwen3-30B-A3B annotated 840K Turkish web samples from FineWeb-2 (tur_Latn)
2. **Deduplication**: SHA256 text dedup → 660K unique samples
3. **Student Training**: Qwen3-0.6B-Base + LoRA (r=16) fine-tuned for 3 epochs
4. **Merging**: LoRA weights merged into base model for efficient inference

## Recommended Threshold

For filtering educational Turkish content, use `score >= 3` (following the FineWeb-Edu methodology).