Aleph-Alpha
/

Aleph-Alpha-GermanWeb-Quality-Classifier-fastText

Model card Files Files and versions

LetiP commited on Apr 30, 2025

Commit

d8c2469

·

verified ·

1 Parent(s): f813a3e

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ To train Aleph-Alpha-GermanWeb-Quality-Classifier-fastText, we used an LLM-as-a-
 For each document, we calculated a combined educational quality score by taking the minimum over the three criteria rated by the LLM-as-a-judge. We then used these educational quality scores as the training signal for the quality classification model. The Aleph-Alpha-GermanWeb-Quality-Classifier-fastText model was tasked with distinguishing between texts with educational quality scores of one or two (“low quality”) vs. four or five (“high quality”) given the document's text.
-We trained Aleph-Alpha-GermanWeb-Quality-Classifier-fastText using 185,403 documents in each class. We used 95% of the data (and the remaining 5% for validation) to train a fastText model to classify between high and low quality text data. It reached 92% precision and 91.5% recall on the validation set.
 Further details, including our LLM judging prompt, can be found in our accompanying paper (link to paper coming soon).

 For each document, we calculated a combined educational quality score by taking the minimum over the three criteria rated by the LLM-as-a-judge. We then used these educational quality scores as the training signal for the quality classification model. The Aleph-Alpha-GermanWeb-Quality-Classifier-fastText model was tasked with distinguishing between texts with educational quality scores of one or two (“low quality”) vs. four or five (“high quality”) given the document's text.
+We trained Aleph-Alpha-GermanWeb-Quality-Classifier-fastText using 185,403 documents in each class. We used 95% of the data (and the remaining 5% for validation) to train a fastText model to classify between high and low quality text data. It reached 77% precision and 77% recall on the validation set.
 Further details, including our LLM judging prompt, can be found in our accompanying paper (link to paper coming soon).