YsK-dev's picture
Upload README.md with huggingface_hub
4f4325e verified
metadata
language: tr
license: apache-2.0
tags:
  - text-classification
  - educational-content
  - turkish
  - fineweb-edu
  - qwen3
datasets:
  - YsK-dev/TurkWeb-Edu-AnnotationsV3
base_model: Qwen/Qwen3-0.6B-Base
pipeline_tag: text-classification

TurkWeb-Edu Classifier V3 🇹🇷

A Turkish educational content classifier that predicts educational quality scores (0-5) for Turkish web text. This is the Turkish equivalent of HuggingFaceFW/fineweb-edu-classifier.

Model Details

Component Details
Base Model Qwen/Qwen3-0.6B-Base
Architecture Qwen3 + Regression Head (LoRA fine-tuned, merged)
Teacher Model Qwen/Qwen3-30B-A3B-Instruct-2507
Training Data YsK-dev/TurkWeb-Edu-AnnotationsV3 (660K samples)
Task Regression (0-5 educational quality score)
Language Turkish (tur_Latn)

Scoring Rubric

Score Meaning
0 Not Educational — Spam, ads, NSFW, navigation-only
1 Low Quality — Personal chat, forum posts, low-quality news
2 Medium — General culture, blog, opinion pieces
3 Educational — Encyclopedic, how-to guides, concept explanations
4 High Quality — Well-structured, high pedagogical value, technical
5 Academic — Textbook quality, sourced, in-depth analysis

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "YsK-dev/TurkWeb-Edu-Classifier-V3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()

print(f"Score: {score:.2f}")
print(f"Int Score: {int(round(max(0, min(score, 5))))}")
# Expected: High score (4-5) for this educational text about photosynthesis

Evaluation

Metric Value
MSE 1.1642
RMSE 1.0790
MAE 0.8374
F1 (edu≥3) 0.7147
F1 (weighted) 0.3956
Accuracy 0.3769

Training Pipeline

  1. Teacher Annotation: Qwen3-30B-A3B annotated 840K Turkish web samples from FineWeb-2 (tur_Latn)
  2. Deduplication: SHA256 text dedup → 660K unique samples
  3. Student Training: Qwen3-0.6B-Base + LoRA (r=16) fine-tuned for 3 epochs
  4. Merging: LoRA weights merged into base model for efficient inference

Recommended Threshold

For filtering educational Turkish content, use score >= 3 (following the FineWeb-Edu methodology).