davanstrien HF Staff commited on
Commit
a5c0962
·
verified ·
1 Parent(s): 842ea78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -12
README.md CHANGED
@@ -18,26 +18,22 @@ datasets:
18
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
19
  should probably proofread and complete it, then remove this comment. -->
20
 
 
21
  # scandi-fine-web-cleaner
22
 
23
- This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the data-is-better-together/fineweb-c dataset.
 
24
  It achieves the following results on the evaluation set:
25
- - Loss: 0.1816
26
- - Precision: 0.9524
27
- - Recall: 0.7018
28
  - F1: 0.8081
29
- - Auc Roc: 0.9648
30
- - Balanced Accuracy: 0.8480
31
- - Average Precision: 0.8906
32
-
33
- ## Model description
34
-
35
- More information needed
36
 
37
  ## Intended uses & limitations
38
 
39
- More information needed
40
 
 
41
  ## Training and evaluation data
42
 
43
  More information needed
 
18
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
19
  should probably proofread and complete it, then remove this comment. -->
20
 
21
+
22
  # scandi-fine-web-cleaner
23
 
24
+ This model is a demo classifier for identifying problematic content (incorrect language, garbled text) in Danish and Swedish web text. It was created as part of a [blog post](https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html) exploring how to filter web data using community annotations. The model was created by fine-tuning [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the [data-is-better-together/fineweb-c](https://huggingface.co/datasets/data-is-better-together/fineweb-c) dataset.
25
+
26
  It achieves the following results on the evaluation set:
27
+ - Precision: 0.9524 (95.2%)
28
+ - Recall: 0.7018 (70.2%)
 
29
  - F1: 0.8081
30
+ - AUC-ROC: 0.9648
 
 
 
 
 
 
31
 
32
  ## Intended uses & limitations
33
 
34
+ The model is intended to be used as a preliminary filter for web text to help improve annotation efficiency. It has only been tested on Danish and Swedish content. The high precision (95.2%) means false positives are rare, while the recall (70.2%) indicates it catches most problematic content.
35
 
36
+ [blog]: <link-to-blog-post>
37
  ## Training and evaluation data
38
 
39
  More information needed