| --- |
| tags: |
| - sentence-transformers |
| - transformers |
| - SetFit |
| - News |
| datasets: KnutJaegersberg/News_topics_IPTC_codes_long |
| pipeline_tag: text-classification |
| --- |
| |
|
|
| # IPTC topic classifier (multilingual) |
|
|
| A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories. |
| The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics. |
| The dataset I used to train the model is based on this file: |
| https://huggingface.co/datasets/KnutJaegersberg/News_topics_IPTC_codes_long |
|
|
| Accuracy on highest level labels in eval: |
| 0.9779412 |
| Accuracy/F1/mcc on mid level labels in eval: |
| 0.6992481/0.6666667/0.6992617 |
|
|
| More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance. |
| https://www.kaggle.com/datasets/rmisra/news-category-dataset |
|
|
| While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category. |
| My current impression is this system is useful for the aimed for purpose. |
|
|
|
|
|
|
| The numeric categories can be joined with the labels by using this table: |
| https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels |
|
|
|
|
| Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that. |
|
|
|
|
| Use like any other SetFit model |
|
|
| from setfit import SetFitModel |
|
|
| # Download from Hub and run inference |
| model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml") |
| # Run inference |
| preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"]) |