YsK-dev's picture
Upload README.md with huggingface_hub
4f4325e verified
---
language: tr
license: apache-2.0
tags:
- text-classification
- educational-content
- turkish
- fineweb-edu
- qwen3
datasets:
- YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: Qwen/Qwen3-0.6B-Base
pipeline_tag: text-classification
---
# TurkWeb-Edu Classifier V3 🇹🇷
A **Turkish educational content classifier** that predicts educational quality scores (0-5) for Turkish web text.
This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).
## Model Details
| Component | Details |
|---|---|
| **Base Model** | `Qwen/Qwen3-0.6B-Base` |
| **Architecture** | Qwen3 + Regression Head (LoRA fine-tuned, merged) |
| **Teacher Model** | `Qwen/Qwen3-30B-A3B-Instruct-2507` |
| **Training Data** | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
| **Task** | Regression (0-5 educational quality score) |
| **Language** | Turkish (tur_Latn) |
## Scoring Rubric
| Score | Meaning |
|---|---|
| 0 | **Not Educational** — Spam, ads, NSFW, navigation-only |
| 1 | **Low Quality** — Personal chat, forum posts, low-quality news |
| 2 | **Medium** — General culture, blog, opinion pieces |
| 3 | **Educational** — Encyclopedic, how-to guides, concept explanations |
| 4 | **High Quality** — Well-structured, high pedagogical value, technical |
| 5 | **Academic** — Textbook quality, sourced, in-depth analysis |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "YsK-dev/TurkWeb-Edu-Classifier-V3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
score = model(**inputs).logits.squeeze().item()
print(f"Score: {score:.2f}")
print(f"Int Score: {int(round(max(0, min(score, 5))))}")
# Expected: High score (4-5) for this educational text about photosynthesis
```
## Evaluation
| Metric | Value |
|---|---|
| MSE | 1.1642 |
| RMSE | 1.0790 |
| MAE | 0.8374 |
| F1 (edu≥3) | 0.7147 |
| F1 (weighted) | 0.3956 |
| Accuracy | 0.3769 |
## Training Pipeline
1. **Teacher Annotation**: Qwen3-30B-A3B annotated 840K Turkish web samples from FineWeb-2 (tur_Latn)
2. **Deduplication**: SHA256 text dedup → 660K unique samples
3. **Student Training**: Qwen3-0.6B-Base + LoRA (r=16) fine-tuned for 3 epochs
4. **Merging**: LoRA weights merged into base model for efficient inference
## Recommended Threshold
For filtering educational Turkish content, use `score >= 3` (following the FineWeb-Edu methodology).