Update README.md
Browse files
README.md
CHANGED
|
@@ -18,26 +18,22 @@ datasets:
|
|
| 18 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 19 |
should probably proofread and complete it, then remove this comment. -->
|
| 20 |
|
|
|
|
| 21 |
# scandi-fine-web-cleaner
|
| 22 |
|
| 23 |
-
This model is a
|
|
|
|
| 24 |
It achieves the following results on the evaluation set:
|
| 25 |
-
-
|
| 26 |
-
-
|
| 27 |
-
- Recall: 0.7018
|
| 28 |
- F1: 0.8081
|
| 29 |
-
-
|
| 30 |
-
- Balanced Accuracy: 0.8480
|
| 31 |
-
- Average Precision: 0.8906
|
| 32 |
-
|
| 33 |
-
## Model description
|
| 34 |
-
|
| 35 |
-
More information needed
|
| 36 |
|
| 37 |
## Intended uses & limitations
|
| 38 |
|
| 39 |
-
|
| 40 |
|
|
|
|
| 41 |
## Training and evaluation data
|
| 42 |
|
| 43 |
More information needed
|
|
|
|
| 18 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 19 |
should probably proofread and complete it, then remove this comment. -->
|
| 20 |
|
| 21 |
+
|
| 22 |
# scandi-fine-web-cleaner
|
| 23 |
|
| 24 |
+
This model is a demo classifier for identifying problematic content (incorrect language, garbled text) in Danish and Swedish web text. It was created as part of a [blog post](https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html) exploring how to filter web data using community annotations. The model was created by fine-tuning [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the [data-is-better-together/fineweb-c](https://huggingface.co/datasets/data-is-better-together/fineweb-c) dataset.
|
| 25 |
+
|
| 26 |
It achieves the following results on the evaluation set:
|
| 27 |
+
- Precision: 0.9524 (95.2%)
|
| 28 |
+
- Recall: 0.7018 (70.2%)
|
|
|
|
| 29 |
- F1: 0.8081
|
| 30 |
+
- AUC-ROC: 0.9648
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Intended uses & limitations
|
| 33 |
|
| 34 |
+
The model is intended to be used as a preliminary filter for web text to help improve annotation efficiency. It has only been tested on Danish and Swedish content. The high precision (95.2%) means false positives are rare, while the recall (70.2%) indicates it catches most problematic content.
|
| 35 |
|
| 36 |
+
[blog]: <link-to-blog-post>
|
| 37 |
## Training and evaluation data
|
| 38 |
|
| 39 |
More information needed
|