YsK-dev
/

TurkWeb-Edu-Classifier-V3

Text Classification

educational-content

Model card Files Files and versions

TurkWeb-Edu-Classifier-V3 / README.md

YsK-dev's picture

Upload README.md with huggingface_hub

4f4325e verified 9 days ago

|

history blame contribute delete

2.89 kB

	---
	language: tr
	license: apache-2.0
	tags:
	- text-classification
	- educational-content
	- turkish
	- fineweb-edu
	- qwen3
	datasets:
	- YsK-dev/TurkWeb-Edu-AnnotationsV3
	base_model: Qwen/Qwen3-0.6B-Base
	pipeline_tag: text-classification
	---

	# TurkWeb-Edu Classifier V3 🇹🇷

	A Turkish educational content classifier that predicts educational quality scores (0-5) for Turkish web text.
	This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).

	## Model Details

	\| Component \| Details \|
	\|---\|---\|
	\| Base Model \| `Qwen/Qwen3-0.6B-Base` \|
	\| Architecture \| Qwen3 + Regression Head (LoRA fine-tuned, merged) \|
	\| Teacher Model \| `Qwen/Qwen3-30B-A3B-Instruct-2507` \|
	\| Training Data \| [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) \|
	\| Task \| Regression (0-5 educational quality score) \|
	\| Language \| Turkish (tur_Latn) \|

	## Scoring Rubric

	\| Score \| Meaning \|
	\|---\|---\|
	\| 0 \| Not Educational — Spam, ads, NSFW, navigation-only \|
	\| 1 \| Low Quality — Personal chat, forum posts, low-quality news \|
	\| 2 \| Medium — General culture, blog, opinion pieces \|
	\| 3 \| Educational — Encyclopedic, how-to guides, concept explanations \|
	\| 4 \| High Quality — Well-structured, high pedagogical value, technical \|
	\| 5 \| Academic — Textbook quality, sourced, in-depth analysis \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "YsK-dev/TurkWeb-Edu-Classifier-V3"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	score = model(**inputs).logits.squeeze().item()

	print(f"Score: {score:.2f}")
	print(f"Int Score: {int(round(max(0, min(score, 5))))}")
	# Expected: High score (4-5) for this educational text about photosynthesis
	```

	## Evaluation

	\| Metric \| Value \|
	\|---\|---\|
	\| MSE \| 1.1642 \|
	\| RMSE \| 1.0790 \|
	\| MAE \| 0.8374 \|
	\| F1 (edu≥3) \| 0.7147 \|
	\| F1 (weighted) \| 0.3956 \|
	\| Accuracy \| 0.3769 \|

	## Training Pipeline

	1. Teacher Annotation: Qwen3-30B-A3B annotated 840K Turkish web samples from FineWeb-2 (tur_Latn)
	2. Deduplication: SHA256 text dedup → 660K unique samples
	3. Student Training: Qwen3-0.6B-Base + LoRA (r=16) fine-tuned for 3 epochs
	4. Merging: LoRA weights merged into base model for efficient inference

	## Recommended Threshold

	For filtering educational Turkish content, use `score >= 3` (following the FineWeb-Edu methodology).