Update README.md
Browse files
README.md
CHANGED
|
@@ -4,6 +4,17 @@ tags:
|
|
| 4 |
- pytorch_model_hub_mixin
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- pytorch_model_hub_mixin
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# FineWeb2-RoEdu-Classifier
|
| 8 |
+
|
| 9 |
+
**FineWeb2-RoEdu-Classifier** is a lightweight quality classifier for the Romanian language. It is designed to distinguish high-quality educational content from generic web text. The model was trained on data annotated by [Gemma3 12B](https://huggingface.co/google/gemma-3-12b-it). More details can be found [here](https://arxiv.org/abs/2511.01090).
|
| 10 |
+
|
| 11 |
+
## Key Features
|
| 12 |
+
|
| 13 |
+
* **Educational Quality Scoring**: The model assigns a scalar score (typically 0-5) to text, reflecting its educational value and coherence.
|
| 14 |
+
* **Topic, Format and Educational Level**: The model also predicts additional signals that could be used for diversity filtering.
|
| 15 |
+
* **Distilled Knowledge**: It is trained on Romanian web samples annotated by **Gemma3 12B**, effectively distilling the frontier model's judgment into a more efficient architecture.
|
| 16 |
+
* **Proven Effectiveness**: We showed that used data curated by this classifier improved several metrics (ARC, HellaSwag).
|
| 17 |
+
|
| 18 |
+
## Usage
|
| 19 |
+
|
| 20 |
+
You can find a demo [here](https://github.com/VladNegoita/FineWeb2-RoEdu-ClassifierDemo/).
|