|
|
--- |
|
|
tags: |
|
|
- model_hub_mixin |
|
|
- pytorch_model_hub_mixin |
|
|
--- |
|
|
|
|
|
# FineWeb2-RoEdu-Classifier |
|
|
|
|
|
**FineWeb2-RoEdu-Classifier** is a lightweight quality classifier for the Romanian language. It is designed to distinguish high-quality educational content from generic web text. The model was trained on data annotated by [Gemma3 12B](https://huggingface.co/google/gemma-3-12b-it). More details can be found [here](https://arxiv.org/abs/2511.01090). |
|
|
|
|
|
## Key Features |
|
|
|
|
|
* **Educational Quality Scoring**: The model assigns a scalar score (typically 0-5) to text, reflecting its educational value and coherence. |
|
|
* **Topic, Format and Educational Level**: The model also predicts additional signals that could be used for diversity filtering. |
|
|
* **Distilled Knowledge**: It is trained on Romanian web samples annotated by **Gemma3 12B**, effectively distilling the frontier model's judgment into a more efficient architecture. |
|
|
* **Proven Effectiveness**: We showed that used data curated by this classifier improved several metrics (ARC, HellaSwag). |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can find a demo [here](https://github.com/VladNegoita/FineWeb2-RoEdu-ClassifierDemo/). |