YsK-dev commited on
Commit
9d1d392
·
verified ·
1 Parent(s): 822510c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ license: apache-2.0
4
+ tags:
5
+ - text-classification
6
+ - educational-content
7
+ - turkish
8
+ - fineweb-edu
9
+ - encoder
10
+ - regression
11
+ datasets:
12
+ - YsK-dev/TurkWeb-Edu-AnnotationsV3
13
+ base_model: boun-tabilab/TabiBERT
14
+ pipeline_tag: text-classification
15
+ ---
16
+
17
+ # TurkWeb-Edu Classifier V4 🇹🇷
18
+
19
+ Fast, accurate Turkish educational content classifier. Predicts educational quality scores (0-5) for Turkish web text.
20
+
21
+ **This is the Turkish equivalent of [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier).**
22
+ changed lr
23
+ ## Usage
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
27
+ import torch
28
+
29
+ model_name = "YsK-dev/TurkWeb-Edu-Classifier-V5"
30
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
31
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
32
+
33
+ text = "Fotosentez, bitkilerin güneş ışığını kullanarak karbondioksit ve suyu glikoz ve oksijene dönüştürdüğü biyokimyasal bir süreçtir."
34
+
35
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
36
+ with torch.no_grad():
37
+ score = model(**inputs).logits.squeeze().item()
38
+
39
+ print(f"Score: {score:.2f}")
40
+ print(f"Educational: {score >= 3}")
41
+ ```
42
+
43
+ ## Model Details
44
+
45
+ | Component | Details |
46
+ |-----------|---------|
47
+ | Base Model | `boun-tabilab/TabiBERT` |
48
+ | Architecture | Encoder + Regression Head |
49
+ | Training Data | [YsK-dev/TurkWeb-Edu-AnnotationsV3](https://huggingface.co/datasets/YsK-dev/TurkWeb-Edu-AnnotationsV3) (660K samples) |
50
+ | Teacher | Qwen3-30B-A3B-Instruct-2507 |
51
+ | Task | Regression (0-5 educational quality score) |
52
+ | Language | Turkish (tur_Latn) |
53
+
54
+ ## Evaluation
55
+
56
+ | Metric | Value |
57
+ |--------|-------|
58
+ | Pearson | 0.8406999707221985 |
59
+ | RMSE | 0.8725 |
60
+ | MAE | 0.6240000128746033 |
61
+ | F1 (edu≥3) | 0.7221 |
62
+ | Exact Accuracy | 0.5152 |
63
+
64
+ ## Scoring Rubric
65
+
66
+ | Score | Meaning |
67
+ |-------|---------|
68
+ | 0 | Not Educational — Spam, ads, NSFW, navigation-only |
69
+ | 1 | Low Quality — Personal chat, forum posts, low-quality news |
70
+ | 2 | Medium — General culture, blog, opinion pieces |
71
+ | 3 | Educational — Encyclopedic, how-to guides, concept explanations |
72
+ | 4 | High Quality — Well-structured, high pedagogical value, technical |
73
+ | 5 | Academic — Textbook quality, sourced, in-depth analysis |
74
+
75
+ ## Recommended Threshold
76
+
77
+ For filtering educational Turkish content, use `score >= 3` (following FineWeb-Edu methodology).