YsK-dev
/

TurkWeb-Edu-Classifier-V4

Text Classification

educational-content

Model card Files Files and versions

TurkWeb-Edu-Classifier-V4 / README.md

YsK-dev's picture

Upload README.md with huggingface_hub

c11326b verified 18 days ago

|

history blame contribute delete

2.43 kB

	---
	language: tr
	license: apache-2.0
	tags:
	- text-classification
	- educational-content
	- turkish
	- fineweb-edu
	- encoder
	- regression
	datasets:
	- YsK-dev/TurkWeb-Edu-AnnotationsV3
	base_model: boun-tabilab/TabiBERT
	pipeline_tag: text-classification
	---

	# TurkWeb-Edu Classifier V4 🇹🇷

	Fast, accurate Turkish educational content classifier. Predicts educational quality scores (0-5) for Turkish web text.

	This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "YsK-dev/TurkWeb-Edu-Classifier-V4"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
	with torch.no_grad():
	score = model(**inputs).logits.squeeze().item()

	print(f"Score: {score:.2f}")
	print(f"Educational: {score >= 3}")
	```

	## Model Details

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Base Model \| `boun-tabilab/TabiBERT` \|
	\| Architecture \| Encoder + Regression Head \|
	\| Training Data \| [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) \|
	\| Teacher \| Qwen3-30B-A3B-Instruct-2507 \|
	\| Task \| Regression (0-5 educational quality score) \|
	\| Language \| Turkish (tur_Latn) \|

	## Evaluation

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Pearson \| 0.8312000036239624 \|
	\| RMSE \| 0.8874 \|
	\| MAE \| 0.6416000127792358 \|
	\| F1 (edu≥3) \| 0.7197 \|
	\| Exact Accuracy \| 0.5044 \|

	## Scoring Rubric

	\| Score \| Meaning \|
	\|-------\|---------\|
	\| 0 \| Not Educational — Spam, ads, NSFW, navigation-only \|
	\| 1 \| Low Quality — Personal chat, forum posts, low-quality news \|
	\| 2 \| Medium — General culture, blog, opinion pieces \|
	\| 3 \| Educational — Encyclopedic, how-to guides, concept explanations \|
	\| 4 \| High Quality — Well-structured, high pedagogical value, technical \|
	\| 5 \| Academic — Textbook quality, sourced, in-depth analysis \|

	## Recommended Threshold

	For filtering educational Turkish content, use `score >= 3` (following FineWeb-Edu methodology).