Upload README.md with huggingface_hub

4f4325e verified 9 days ago

2.89 kB

language: tr
license: apache-2.0
tags:
  - text-classification
  - educational-content
  - turkish
  - fineweb-edu
  - qwen3
datasets:
  - YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: Qwen/Qwen3-0.6B-Base
pipeline_tag: text-classification

TurkWeb-Edu Classifier V3 🇹🇷

A Turkish educational content classifier that predicts educational quality scores (0-5) for Turkish web text. This is the Turkish equivalent of HuggingFaceFW/fineweb-edu-classifier.

Model Details

Component	Details
Base Model	`Qwen/Qwen3-0.6B-Base`
Architecture	Qwen3 + Regression Head (LoRA fine-tuned, merged)
Teacher Model	`Qwen/Qwen3-30B-A3B-Instruct-2507`
Training Data	YsK-dev/TurkWeb-Edu-AnnotationsV3 (660K samples)
Task	Regression (0-5 educational quality score)
Language	Turkish (tur_Latn)

Scoring Rubric

Score	Meaning
0	Not Educational — Spam, ads, NSFW, navigation-only
1	Low Quality — Personal chat, forum posts, low-quality news
2	Medium — General culture, blog, opinion pieces
3	Educational — Encyclopedic, how-to guides, concept explanations
4	High Quality — Well-structured, high pedagogical value, technical
5	Academic — Textbook quality, sourced, in-depth analysis

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YsK-dev/TurkWeb-Edu-Classifier-V3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()

print(f"Score: {score:.2f}")
print(f"Int Score: {int(round(max(0, min(score, 5))))}")
# Expected: High score (4-5) for this educational text about photosynthesis

Evaluation

Metric	Value
MSE	1.1642
RMSE	1.0790
MAE	0.8374
F1 (edu≥3)	0.7147
F1 (weighted)	0.3956
Accuracy	0.3769

Training Pipeline

Teacher Annotation: Qwen3-30B-A3B annotated 840K Turkish web samples from FineWeb-2 (tur_Latn)
Deduplication: SHA256 text dedup → 660K unique samples
Student Training: Qwen3-0.6B-Base + LoRA (r=16) fine-tuned for 3 epochs
Merging: LoRA weights merged into base model for efficient inference

Recommended Threshold

For filtering educational Turkish content, use score >= 3 (following the FineWeb-Edu methodology).