davanstrien
/

scandi-fine-web-cleaner

Text Classification

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

Metrics Training metrics Community

davanstrien HF Staff commited on Jan 13, 2025

Commit

a5c0962

·

verified ·

1 Parent(s): 842ea78

Update README.md

Files changed (1) hide show

README.md +8 -12

README.md CHANGED Viewed

@@ -18,26 +18,22 @@ datasets:
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
 # scandi-fine-web-cleaner
-This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the data-is-better-together/fineweb-c dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.1816
-- Precision: 0.9524
-- Recall: 0.7018
 - F1: 0.8081
-- Auc Roc: 0.9648
-- Balanced Accuracy: 0.8480
-- Average Precision: 0.8906
-## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
 More information needed

 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
 # scandi-fine-web-cleaner
+This model is a demo classifier for identifying problematic content (incorrect language, garbled text) in Danish and Swedish web text. It was created as part of a [blog post](https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html) exploring how to filter web data using community annotations. The model was created by fine-tuning [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on the [data-is-better-together/fineweb-c](https://huggingface.co/datasets/data-is-better-together/fineweb-c) dataset.
 It achieves the following results on the evaluation set:
+- Precision: 0.9524 (95.2%)
+- Recall: 0.7018 (70.2%)
 - F1: 0.8081
+- AUC-ROC: 0.9648
 ## Intended uses & limitations
+The model is intended to be used as a preliminary filter for web text to help improve annotation efficiency. It has only been tested on Danish and Swedish content. The high precision (95.2%) means false positives are rare, while the recall (70.2%) indicates it catches most problematic content.
+[blog]: <link-to-blog-post>
 ## Training and evaluation data
 More information needed