classla
/

multilingual-IPTC-news-topic-classifier

Text Classification

topic categorization

Model card Files Files and versions

Taja Kuzman commited on Nov 6, 2024

Commit

4b28311

·

verified ·

1 Parent(s): ad2fac9

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -216,7 +216,7 @@ and enriched with information which specific subtopics belong to the top-level t
 ## Training data
 The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
-The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
 The training dataset was automatically annotated with the IPTC Media Topic labels by
 the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).

 ## Training data
 The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
+The news texts were extracted from the [MaCoCu-Genre web corpora](http://hdl.handle.net/11356/1969) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
 The training dataset was automatically annotated with the IPTC Media Topic labels by
 the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).