linha12 TajaKuzmanPungersek commited on
Commit
86f6671
·
0 Parent(s):

Duplicate from classla/multilingual-IPTC-news-topic-classifier

Browse files

Co-authored-by: Taja Kuzman Pungeršek <TajaKuzmanPungersek@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,353 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ language:
4
+ - multilingual
5
+ - af
6
+ - am
7
+ - ar
8
+ - as
9
+ - az
10
+ - be
11
+ - bg
12
+ - bn
13
+ - br
14
+ - bs
15
+ - ca
16
+ - cs
17
+ - cy
18
+ - da
19
+ - de
20
+ - el
21
+ - en
22
+ - eo
23
+ - es
24
+ - et
25
+ - eu
26
+ - fa
27
+ - fi
28
+ - fr
29
+ - fy
30
+ - ga
31
+ - gd
32
+ - gl
33
+ - gu
34
+ - ha
35
+ - he
36
+ - hi
37
+ - hr
38
+ - hu
39
+ - hy
40
+ - id
41
+ - is
42
+ - it
43
+ - ja
44
+ - jv
45
+ - ka
46
+ - kk
47
+ - km
48
+ - kn
49
+ - ko
50
+ - ku
51
+ - ky
52
+ - la
53
+ - lo
54
+ - lt
55
+ - lv
56
+ - mg
57
+ - mk
58
+ - ml
59
+ - mn
60
+ - mr
61
+ - ms
62
+ - my
63
+ - ne
64
+ - nl
65
+ - 'no'
66
+ - om
67
+ - or
68
+ - pa
69
+ - pl
70
+ - ps
71
+ - pt
72
+ - ro
73
+ - ru
74
+ - sa
75
+ - sd
76
+ - si
77
+ - sk
78
+ - sl
79
+ - so
80
+ - sq
81
+ - sr
82
+ - su
83
+ - sv
84
+ - sw
85
+ - ta
86
+ - te
87
+ - th
88
+ - tl
89
+ - tr
90
+ - ug
91
+ - uk
92
+ - ur
93
+ - uz
94
+ - vi
95
+ - xh
96
+ - yi
97
+ - zh
98
+ tags:
99
+ - text-classification
100
+ - IPTC
101
+ - news
102
+ - news topic
103
+ - IPTC topic
104
+ - IPTC NewsCode
105
+ - topic categorization
106
+ widget:
107
+ - text: >-
108
+ Moment dog sparks house fire after chewing power bank An indoor monitoring
109
+ camera shows the moment a dog unintentionally caused a house fire after
110
+ chewing on a portable lithium-ion battery power bank.
111
+ example_title: English
112
+ - text: >-
113
+ Ministarstvo unutarnjih poslova posljednjih mjeseci radilo je na izradi
114
+ Nacrta prijedloga Zakona o strancima. Naime, važeći Zakon o strancima
115
+ usklađen je s 22 direktive, preporuke, odluke i rezolucije, te s obzirom da
116
+ je riječ o velikom broju odredaba potrebno ih je jasnije propisati, a sve u
117
+ cilju poboljšanja transparentnosti i preglednosti.
118
+ example_title: Croatian
119
+ - text: >-
120
+ V okviru letošnjega praznovanja spominskega dneva občine Trebnje Baragov dan
121
+ je v soboto, 28. junija 2014, na obvezni god Marijinega Srca v župnijski
122
+ cerkvi v Trebnjem daroval mašo za domovino apostolski nuncij v Republiki
123
+ Sloveniji Njegova ekselenca Nadškof msgr. Juliusz Janusz.
124
+ example_title: Slovenian
125
+ base_model:
126
+ - FacebookAI/xlm-roberta-large
127
+ pipeline_tag: text-classification
128
+ ---
129
+
130
+ # Multilingual IPTC Media Topic Classifier
131
+
132
+ News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
133
+ and fine-tuned on a [news corpus in 4 languages](http://hdl.handle.net/11356/1991) (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
134
+ Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
135
+ The development and evaluation of the model is described in the paper
136
+ [LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification](https://doi.org/10.1109/ACCESS.2025.3544814) (Kuzman and Ljubešić, 2025).
137
+
138
+ The model can be used for classification into topic labels from the
139
+ [IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
140
+ applied to any news text in a language, supported by the `xlm-roberta-large`.
141
+
142
+ Based on a manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
143
+ the model achieves macro-F1 score of 0.746, micro-F1 score of 0.734, and accuracy of 0.734,
144
+ and outperforms the GPT-4o model (version `gpt-4o-2024-05-13`) used in a zero-shot setting.
145
+ If we use only labels that are predicted with a confidence score equal or higher than 0.90,
146
+ the model achieves micro-F1 and macro-F1 of 0.80.
147
+
148
+ ## Intended use and limitations
149
+
150
+ For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words).
151
+
152
+
153
+ Use example:
154
+
155
+ ```python
156
+ from transformers import pipeline
157
+
158
+ # Load a multi-class classification pipeline - if the model runs on CPU, comment out "device"
159
+ classifier = pipeline("text-classification", model="classla/multilingual-IPTC-news-topic-classifier", device=0, max_length=512, truncation=True)
160
+
161
+ # Example texts to classify
162
+ texts = [
163
+ """Slovenian handball team makes it to Paris Olympics semifinal Lille, 8 August - Slovenia defeated Norway 33:28 in the Olympic men's handball tournament in Lille late on Wednesday to advance to the semifinal where they will face Denmark on Friday evening. This is the best result the team has so far achieved at the Olympic Games and one of the best performances in the history of Slovenia's team sports squads.""",
164
+ """Moment dog sparks house fire after chewing power bank An indoor monitoring camera shows the moment a dog unintentionally caused a house fire after chewing on a portable lithium-ion battery power bank. In the video released by Tulsa Fire Department in Oklahoma, two dogs and a cat can be seen in the living room before a spark started the fire that spread within minutes. Tulsa Fire Department public information officer Andy Little said the pets escaped through a dog door, and according to local media the family was also evacuated safely. "Had there not been a dog door, they very well could have passed away," he told CBS affiliate KOTV."""]
165
+
166
+ # Classify the texts
167
+ results = classifier(texts)
168
+
169
+ # Output the results
170
+ for result in results:
171
+ print(result)
172
+
173
+ ## Output
174
+ ## {'label': 'sport', 'score': 0.9985264539718628}
175
+ ## {'label': 'disaster, accident and emergency incident', 'score': 0.9957459568977356}
176
+
177
+ ```
178
+
179
+ The code for massive corpora annotation with topic labels is available [here](https://colab.research.google.com/drive/1IYDNsr8E5Nj6ehxLaa4v4RfOqwGWPYB7?usp=sharing).
180
+
181
+ ## IPTC Media Topic categories
182
+
183
+ The classifier uses the top-level of the [IPTC Media Topic NewsCodes](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) schema, consisting of 17 labels.
184
+
185
+ ### List of labels
186
+
187
+ ```
188
+ labels_list=['education', 'human interest', 'society', 'sport', 'crime, law and justice',
189
+ 'disaster, accident and emergency incident', 'arts, culture, entertainment and media', 'politics',
190
+ 'economy, business and finance', 'lifestyle and leisure', 'science and technology',
191
+ 'health', 'labour', 'religion', 'weather', 'environment', 'conflict, war and peace'],
192
+
193
+ labels_map={0: 'education', 1: 'human interest', 2: 'society', 3: 'sport', 4: 'crime, law and justice',
194
+ 5: 'disaster, accident and emergency incident', 6: 'arts, culture, entertainment and media',
195
+ 7: 'politics', 8: 'economy, business and finance', 9: 'lifestyle and leisure', 10: 'science and technology',
196
+ 11: 'health', 12: 'labour', 13: 'religion', 14: 'weather', 15: 'environment', 16: 'conflict, war and peace'}
197
+ ```
198
+
199
+ ### Description of labels
200
+
201
+ The descriptions of the labels are based on the descriptions provided in the [IPTC Media Topic NewsCodes schema](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html)
202
+ and enriched with information which specific subtopics belong to the top-level topics, based on the IPTC Media Topic label hierarchy.
203
+
204
+ | Label | Description |
205
+ |:------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
206
+ | disaster, accident and emergency incident | Man-made or natural events resulting in injuries, death or damage, e.g., explosions, transport accidents, famine, drowning, natural disasters, emergency planning and response. |
207
+ | human interest | News about life and behavior of royalty and celebrities, news about obtaining awards, ceremonies (graduation, wedding, funeral, celebration of launching something), birthdays and anniversaries, and news about silly or stupid human errors. |
208
+ | politics | News about local, regional, national and international exercise of power, including news about election, fundamental rights, government, non-governmental organisations, political crises, non-violent international relations, public employees, government policies. |
209
+ | education | All aspects of furthering knowledge, formally or informally, including news about schools, curricula, grading, remote learning, teachers and students. |
210
+ | crime, law and justice | News about committed crime and illegal activities, the system of courts, law and law enforcement (e.g., judges, lawyers, trials, punishments of offenders). |
211
+ | economy, business and finance | News about companies, products and services, any kind of industries, national economy, international trading, banks, (crypto)currency, business and trade societies, economic trends and indicators (inflation, employment statistics, GDP, mortgages, ...), international economic institutions, utilities (electricity, heating, waste management, water supply). |
212
+ | conflict, war and peace | News about terrorism, wars, wars victims, cyber warfare, civil unrest (demonstrations, riots, rebellions), peace talks and other peace activities. |
213
+ | arts, culture, entertainment and media | News about cinema, dance, fashion, hairstyle, jewellery, festivals, literature, music, theatre, TV shows, painting, photography, woodworking, art exhibitions, libraries and museums, language, cultural heritage, news media, radio and television, social media, influencers, and disinformation. |
214
+ | labour | News about employment, employment legislation, employees and employers, commuting, parental leave, volunteering, wages, social security, labour market, retirement, unemployment, unions. |
215
+ | weather | News about weather forecasts, weather phenomena and weather warning. |
216
+ | religion | News about religions, cults, religious conflicts, relations between religion and government, churches, religious holidays and festivals, religious leaders and rituals, and religious texts. |
217
+ | society | News about social interactions (e.g., networking), demographic analyses, population census, discrimination, efforts for inclusion and equity, emigration and immigration, communities of people and minorities (LGBTQ, older people, children, indigenous people, etc.), homelessness, poverty, societal problems (addictions, bullying), ethical issues (suicide, euthanasia, sexual behavior) and social services and charity, relationships (dating, divorce, marriage), family (family planning, adoption, abortion, contraception, pregnancy, parenting). |
218
+ | health | News about diseases, injuries, mental health problems, health treatments, diets, vaccines, drugs, government health care, hospitals, medical staff, health insurance. |
219
+ | environment | News about climate change, energy saving, sustainability, pollution, population growth, natural resources, forests, mountains, bodies of water, ecosystem, animals, flowers and plants. |
220
+ | lifestyle and leisure | News about hobbies, clubs and societies, games, lottery, enthusiasm about food or drinks, car/motorcycle lovers, public holidays, leisure venues (amusement parks, cafes, bars, restaurants, etc.), exercise and fitness, outdoor recreational activities (e.g., fishing, hunting), travel and tourism, mental well-being, parties, maintaining and decorating house and garden. |
221
+ | science and technology | News about natural sciences and social sciences, mathematics, technology and engineering, scientific institutions, scientific research, scientific publications and innovation. |
222
+ | sport | News about sports that can be executed in competitions, e.g., basketball, football, swimming, athletics, chess, dog racing, diving, golf, gymnastics, martial arts, climbing, etc.; sport achievements, sport events, sport organisation, sport venues (stadiums, gymnasiums, ...), referees, coaches, sport clubs, drug use in sport. |
223
+
224
+ ## Training data
225
+
226
+ The model was fine-tuned on the training split of the [EMMediaTopic 1.0 dataset](http://hdl.handle.net/11356/1991) consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
227
+ The news texts were extracted from the [MaCoCu-Genre web corpora](http://hdl.handle.net/11356/1969) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
228
+ The training dataset was automatically annotated with the IPTC Media Topic labels by
229
+ the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).
230
+
231
+ The code for the development and evaluation of the model is available on [this GitHub repository](https://github.com/TajaKuzman/IPTC-Media-Topic-Classification).
232
+
233
+ Label distribution in the training dataset:
234
+
235
+ | labels | count | proportion |
236
+ |:------------------------------------------|--------:|-------------:|
237
+ | sport | 2300 | 0.153333 |
238
+ | arts, culture, entertainment and media | 2117 | 0.141133 |
239
+ | politics | 2018 | 0.134533 |
240
+ | economy, business and finance | 1670 | 0.111333 |
241
+ | human interest | 1152 | 0.0768 |
242
+ | education | 990 | 0.066 |
243
+ | crime, law and justice | 884 | 0.0589333 |
244
+ | health | 675 | 0.045 |
245
+ | disaster, accident and emergency incident | 610 | 0.0406667 |
246
+ | society | 481 | 0.0320667 |
247
+ | environment | 472 | 0.0314667 |
248
+ | lifestyle and leisure | 346 | 0.0230667 |
249
+ | science and technology | 340 | 0.0226667 |
250
+ | conflict, war and peace | 311 | 0.0207333 |
251
+ | labour | 288 | 0.0192 |
252
+ | religion | 258 | 0.0172 |
253
+ | weather | 88 | 0.00586667 |
254
+
255
+ ## Performance
256
+
257
+ The model was evaluated on a manually-annotated test set in four languages (Croatian, Slovenian, Catalan and Greek),
258
+ consisting of 1,129 instances.
259
+ The test set contains similar amounts of texts from the four languages and is more or less balanced across labels.
260
+
261
+ The model was shown to achieve micro-F1 score of 0.734, and macro-F1 score of 0.746. The results for the entire test set and per language:
262
+
263
+ | | Micro-F1 | Macro-F1 | Accuracy | No. of instances |
264
+ |:---|-----------:|-----------:|-----------:|-----------:|
265
+ | All (combined) | 0.734278 | 0.745864 | 0.734278 | 1129 |
266
+ | Croatian | 0.728522 | 0.733725 | 0.728522 | 291 |
267
+ | Catalan | 0.715356 | 0.722304 | 0.715356 | 267 |
268
+ | Slovenian | 0.758865 | 0.764784 | 0.758865 | 282 |
269
+ | Greek | 0.733564 | 0.747129 | 0.733564 | 289 |
270
+
271
+
272
+ Performance per label:
273
+
274
+ | | precision | recall | f1-score | support |
275
+ |:------------------------------------------|------------:|---------:|-----------:|------------:|
276
+ | arts, culture, entertainment and media | 0.602151 | 0.875 | 0.713376 | 64 |
277
+ | conflict, war and peace | 0.611111 | 0.916667 | 0.733333 | 36 |
278
+ | crime, law and justice | 0.861538 | 0.811594 | 0.835821 | 69 |
279
+ | disaster, accident and emergency incident | 0.691176 | 0.886792 | 0.77686 | 53 |
280
+ | economy, business and finance | 0.779221 | 0.508475 | 0.615385 | 118 |
281
+ | education | 0.847458 | 0.735294 | 0.787402 | 68 |
282
+ | environment | 0.589041 | 0.754386 | 0.661538 | 57 |
283
+ | health | 0.79661 | 0.79661 | 0.79661 | 59 |
284
+ | human interest | 0.552239 | 0.672727 | 0.606557 | 55 |
285
+ | labour | 0.855072 | 0.830986 | 0.842857 | 71 |
286
+ | lifestyle and leisure | 0.773585 | 0.476744 | 0.589928 | 86 |
287
+ | politics | 0.568182 | 0.735294 | 0.641026 | 68 |
288
+ | religion | 0.842105 | 0.941176 | 0.888889 | 51 |
289
+ | science and technology | 0.637681 | 0.8 | 0.709677 | 55 |
290
+ | society | 0.918033 | 0.5 | 0.647399 | 112 |
291
+ | sport | 0.824324 | 0.968254 | 0.890511 | 63 |
292
+ | weather | 0.953488 | 0.931818 | 0.942529 | 44 |
293
+
294
+
295
+ For downstream tasks, **we advise you to use only labels that were predicted with confidence score
296
+ higher or equal to 0.90 which further improves the performance**.
297
+
298
+ When we remove instances predicted with lower confidence (229 instances - 20%), the model yields micro-F1 of 0.798 and macro-F1 of 0.80.
299
+
300
+ | | Micro-F1 | Macro-F1 | Accuracy |
301
+ |:---|-----------:|-----------:|-----------:|
302
+ | All (combined) | 0.797777 | 0.802403 | 0.797777 |
303
+ | Croatian | 0.773504 | 0.772084 | 0.773504 |
304
+ | Catalan | 0.811224 | 0.806885 | 0.811224 |
305
+ | Slovenian | 0.805085 | 0.804491 | 0.805085 |
306
+ | Greek | 0.803419 | 0.809598 | 0.803419 |
307
+
308
+ ## Fine-tuning hyperparameters
309
+
310
+ Fine-tuning was performed with `simpletransformers`.
311
+ Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
312
+
313
+ ```python
314
+ model_args = ClassificationArgs()
315
+
316
+ model_args ={
317
+ "num_train_epochs": 5,
318
+ "learning_rate": 8e-06,
319
+ "train_batch_size": 32,
320
+ "max_seq_length": 512,
321
+ }
322
+
323
+ ```
324
+
325
+ ## Citation
326
+
327
+ If you use the model, please cite [this paper](https://doi.org/10.1109/ACCESS.2025.3544814) and the model itself:
328
+
329
+ ```
330
+ @ARTICLE{10900365,
331
+ author={Kuzman, Taja and Ljubešić, Nikola},
332
+ journal={IEEE Access},
333
+ title={LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification},
334
+ year={2025},
335
+ volume={},
336
+ number={},
337
+ pages={1-1},
338
+ keywords={Data models;Annotations;Media;Manuals;Multilingual;Computational modeling;Training;Training data;Transformers;Text categorization;Multilingual text classification;IPTC;large language models;LLMs;news topic;topic classification;training data preparation;data annotation},
339
+ doi={10.1109/ACCESS.2025.3544814}}
340
+
341
+ @misc{iptc_topic_model,
342
+ author = { Kuzman, Taja and Ljube{\v s}i{\'c}, Nikola },
343
+ title = {{ Multilingual IPTC News Topic Classifier}},
344
+ year = 2025,
345
+ url = { https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier },
346
+ doi = { 10.57967/hf/4709 },
347
+ publisher = { Hugging Face }
348
+ }
349
+ ```
350
+
351
+ ## Funding
352
+
353
+ This work was supported by the Slovenian Research and Innovation Agency research project [Embeddings-based techniques for Media Monitoring Applications](https://emma.ijs.si/en/about-project/) (L2-50070, co-funded by the Kliping d.o.o. agency).
config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-large",
3
+ "architectures": [
4
+ "XLMRobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "id2label": {
14
+ "0": "education",
15
+ "1": "human interest",
16
+ "2": "society",
17
+ "3": "sport",
18
+ "4": "crime, law and justice",
19
+ "5": "disaster, accident and emergency incident",
20
+ "6": "arts, culture, entertainment and media",
21
+ "7": "politics",
22
+ "8": "economy, business and finance",
23
+ "9": "lifestyle and leisure",
24
+ "10": "science and technology",
25
+ "11": "health",
26
+ "12": "labour",
27
+ "13": "religion",
28
+ "14": "weather",
29
+ "15": "environment",
30
+ "16": "conflict, war and peace"
31
+ },
32
+ "initializer_range": 0.02,
33
+ "intermediate_size": 4096,
34
+ "label2id": {
35
+ "education": 0,
36
+ "human interest": 1,
37
+ "society": 2,
38
+ "sport": 3,
39
+ "crime, law and justice": 4,
40
+ "disaster, accident and emergency incident": 5,
41
+ "arts, culture, entertainment and media": 6,
42
+ "politics": 7,
43
+ "economy, business and finance": 8,
44
+ "lifestyle and leisure": 9,
45
+ "science and technology": 10,
46
+ "health": 11,
47
+ "labour": 12,
48
+ "religion": 13,
49
+ "weather": 14,
50
+ "environment": 15,
51
+ "conflict, war and peace": 16
52
+ },
53
+ "layer_norm_eps": 1e-05,
54
+ "max_position_embeddings": 514,
55
+ "model_type": "xlm-roberta",
56
+ "num_attention_heads": 16,
57
+ "num_hidden_layers": 24,
58
+ "output_past": true,
59
+ "pad_token_id": 1,
60
+ "position_embedding_type": "absolute",
61
+ "problem_type": "single_label_classification",
62
+ "torch_dtype": "float32",
63
+ "transformers_version": "4.42.4",
64
+ "type_vocab_size": 1,
65
+ "use_cache": true,
66
+ "vocab_size": 250002
67
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:357862230f8b370be969e8fdeecc58d3cc7776b71c7d74c3619609669c0b54d7
3
+ size 2239680172
model_args.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_betas": [0.9, 0.999], "adam_epsilon": 1e-08, "best_model_dir": "outputs/best_model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 100, "evaluate_during_training": false, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": true, "gradient_accumulation_steps": 1, "learning_rate": 8e-06, "local_rank": -1, "logging_steps": 50, "loss_type": null, "loss_args": {}, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 512, "model_name": "xlm-roberta-large", "model_type": "xlmroberta", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 5, "optimizer": "AdamW", "output_dir": "15k-model-v2", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 254, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": true, "save_model_every_epoch": false, "save_optimizer_and_scheduler": true, "save_steps": -1, "scheduler": "linear_schedule_with_warmup", "silent": true, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": "xlm-roberta-large", "tokenizer_type": null, "train_batch_size": 32, "train_custom_parameters_only": false, "trust_remote_code": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": false, "use_multiprocessing_for_evaluation": false, "wandb_kwargs": {}, "wandb_project": "IPTC", "warmup_ratio": 0.06, "warmup_steps": 141, "weight_decay": 0.0, "model_class": "ClassificationModel", "labels_list": ["education", "human interest", "society", "sport", "crime, law and justice", "disaster, accident and emergency incident", "arts, culture, entertainment and media", "politics", "economy, business and finance", "lifestyle and leisure", "science and technology", "health", "labour", "religion", "weather", "environment", "conflict, war and peace"], "labels_map": {"education": 0, "human interest": 1, "society": 2, "sport": 3, "crime, law and justice": 4, "disaster, accident and emergency incident": 5, "arts, culture, entertainment and media": 6, "politics": 7, "economy, business and finance": 8, "lifestyle and leisure": 9, "science and technology": 10, "health": 11, "labour": 12, "religion": 13, "weather": 14, "environment": 15, "conflict, war and peace": 16}, "lazy_delimiter": "\t", "lazy_labels_column": 1, "lazy_loading": false, "lazy_loading_start_line": 1, "lazy_text_a_column": null, "lazy_text_b_column": null, "lazy_text_column": 0, "onnx": false, "regression": false, "sliding_window": false, "special_tokens_list": [], "stride": 0.8, "tie_value": 1}
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ffb37461c391f096759f4a9bbbc329da0f36952f88bab061fcf84940c022e98
3
+ size 17082999
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd99a34b1d5fd589a01c16d0ce04bca02b5b11815025e32be03b75b088a15264
3
+ size 3695