Text Classification
annif
subject_indexing
glam
lam
Jrglmn commited on
Commit
6aa5fb2
·
verified ·
1 Parent(s): 59756cf

Create Model Card

Browse files
Files changed (1) hide show
  1. README.md +207 -3
README.md CHANGED
@@ -1,3 +1,207 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - SBB/BK-Training-Dataset
5
+ pipeline_tag: text-classification
6
+ tags:
7
+ - annif
8
+ - subject_indexing
9
+ - glam
10
+ - lam
11
+ ---
12
+
13
+ # Model Card for bk-omikuji
14
+
15
+ An [Annif](https://annif.org/) model, trained on bibliographic metadata for automatic subject indexing tasks. It classifies a given text such as a title or description into one or multiple subjects from the classification system *Nederlandse basisclassificatie* / [Basisklassifikation](http://uri.gbv.de/terminology/bk/) (BK). The model was developed in the research project [Human.Machine.Culture](https://mmk.sbb.berlin/?lang=en) at Staatsbibliothek zu Berlin – Berlin State Library (SBB).
16
+
17
+ Questions and comments about the model can be directed to Sophie Schneider at Sophie.Schneider@sbb.spk-berlin.de.
18
+
19
+
20
+ # Table of Contents
21
+
22
+ - [Model Card for bk-omikuji](#model-card-for-bk-omikuji)
23
+ - [Table of Contents](#table-of-contents)
24
+ - [Model Details](#model-details)
25
+ - [Model Description](#model-description)
26
+ - [Uses](#uses)
27
+ - [Direct Use](#direct-use)
28
+ - [Downstream Use](#downstream-use)
29
+ - [Out-of-Scope Use](#out-of-scope-use)
30
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
31
+ - [Recommendations](#recommendations)
32
+ - [Training Details](#training-details)
33
+ - [Training Data](#training-data)
34
+ - [Training Procedure](#training-procedure)
35
+ - [Preprocessing](#preprocessing)
36
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
37
+ - [Training hyperparameters](#training-hyperparameters)
38
+ - [Training results](#training-results)
39
+ - [Evaluation](#evaluation)
40
+ - [Testing Data and Metrics](#testing-data-and-metrics)
41
+ - [Testing Data](#testing-data)
42
+ - [Metrics](#metrics)
43
+ - [Results](#results)
44
+ - [Model Examination](#model-examination)
45
+ - [Environmental Impact](#environmental-impact)
46
+ - [Technical Specifications](#technical-specifications)
47
+ - [Model Architecture and Objective](#model-architecture-and-objective)
48
+ - [Software](#software)
49
+ - [Model Card Authors](#model-card-authors)
50
+ - [Model Card Contact](#model-card-contact)
51
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
52
+
53
+
54
+
55
+ # Model Details
56
+
57
+ ## Model Description
58
+
59
+ An [Annif](https://annif.org/) model, trained on title/catalogue metadata for automatic subject indexing tasks. Subject indexing is a classical library task, aiming at describing the content of a resource. The model is intended to be used to automatically classify titles or texts that have not been classified manually so far. For each input, the model outputs one or multiple subjects from the [BK](http://uri.gbv.de/terminology/bk/) classification system. It is part of a collection of 3 models, created with the help of the Annif toolkit, which address the task of automated subject indexing.
60
+
61
+ - **Developed by:** [Sophie Schneider](Sophie.Schneider@sbb.spk-berlin.de)
62
+ - **Shared by [Optional]:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
63
+ - **Model type:** tree-based
64
+ - **Language(s) (NLP):** multilingual
65
+ - **License:** apache-2.0
66
+
67
+ # Uses
68
+
69
+ ## Direct Use
70
+
71
+ This model can directly be used to automatically classify texts with the BK classification scheme. It is intended to be used together with the Annif automated subject indexing toolkit version 1.1.0.
72
+
73
+
74
+
75
+ ## Downstream Use
76
+
77
+ Other/downstream uses outside of the Annif setting described above are not intended but also not excluded.
78
+
79
+
80
+
81
+ ## Out-of-Scope Use
82
+
83
+ The task of the model is primarily the subject indexing of bibliographic metadata. Basically, the model can be applied to any kind of text describing publications. However, it might not be suitable for colloquial texts from the internet.
84
+
85
+
86
+
87
+ # Bias, Risks, and Limitations
88
+
89
+ The BK was introduced in the 1980s and therefore its structure partly represents outdated ways of thinking. In itself, the BK is biased, for some examples see [https://verbundkonferenz.gbv.de/wp-content/uploads/2024/09/2024-08-20_VK_beckmann_Kunst-oder-Krempel-Potenziale-der-Basisklassifikation.pdf](https://verbundkonferenz.gbv.de/wp-content/uploads/2024/09/2024-08-20_VK_beckmann_Kunst-oder-Krempel-Potenziale-der-Basisklassifikation.pdf), slide 17. Hence, typical biases are e.g. the binary understanding of gender roles, outmoded use of geographical designators or of social movements. However, in order to keep it up to date, the classification system is under constant revision. As of now, the classes suggested for an input text might not be suitable for today’s understanding and might not conform to contemporary values.
90
+
91
+
92
+
93
+ ## Recommendations
94
+
95
+ This BK model has been trained solely on the basis of title data. We recommend being aware of this limitation and, if available, to use additional training data such as full texts or tables of contents to enhance the existing model (e.g. by running `annif learn` or combining models trained with different kinds of training data to an ensemble model).
96
+
97
+
98
+
99
+ # Training Details
100
+
101
+ ## Training Data
102
+
103
+ The **vocabulary file** for the BK was downloaded directly from [https://api.dante.gbv.de/export/download/bk/default/](https://api.dante.gbv.de/export/download/bk/default/) (`bk__default.turtle.ttl` as of 05/2024).
104
+
105
+ For the **subject data**, several versions were created for each model with slight adaptations.
106
+
107
+ A dataset was generated following these steps:
108
+ - downloading the `kxp-subjects.tsv.gz` of Voß, J., & Verbundzentrale des GBV. (2024). Normalized subject indexing data of K10plus library union catalog (2024-02-26) [Data set]. VZG. [https://doi.org/10.5281/zenodo.10933926](https://doi.org/10.5281/zenodo.10933926)
109
+ - cleaning this file to only contain PPNs together with their BK notations
110
+ - querying the corresponding BK titles (Pica3 field [4000](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat&val=4000), subfields a and d) for all PPNs from previous step via [unapi](http://unapi.k10plus.de/)
111
+ - merge the subject and title data based on PPN, filter out duplicates (identical titles)
112
+ - this led to a dataset of overall 6.114.950 entries, split into 80% train and 10% for test and validation subsets
113
+
114
+ The training dataset has been published on [Zenodo](https://doi.org/10.5281/zenodo.15690227).
115
+
116
+
117
+
118
+ ## Training Procedure
119
+
120
+ The training procedure includes loading the BK vocabulary into Annif and training the [Omikuji backend](https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji) with the help of the respective training data. Further aspects on technical specifications can be found in the section [Training hyperparameters](#training-hyperparameters).
121
+
122
+
123
+ ### Preprocessing
124
+
125
+ Besides merging and transforming the data as described under [Training Data](#training-data), no further preprocessing of natural language or similar has been performed in the dataset creation. For training the model, we applied the [simple analyzer](https://github.com/NatLibFi/Annif/wiki/Analyzers#simple-analyzer) provided by Annif.
126
+
127
+ ### Speeds, Sizes, Times
128
+
129
+ Training takes from several minutes to an hour on an 2,8 GHz CPU with 32 cores, depending on the choice of dataset and algorithm as well as hyperparameter settings.
130
+
131
+ ### Training hyperparameters
132
+
133
+ name=BK Omikuji
134
+ language=de
135
+ backend=omikuji
136
+ analyzer=simple
137
+ vocab=bk
138
+ cluster_balanced=False
139
+ cluster_k=100
140
+ max_depth=3
141
+
142
+
143
+ ### Training results
144
+
145
+ documents evaluated: 5000
146
+
147
+ * Precision (`--limit` 2, `--threshold` 0.05): 0.4664
148
+ * Recall (`--limit` 2, `--threshold` 0.05): 0.5382
149
+ * F1 (`--limit` 2, `--threshold` 0.05): 0.4775
150
+ * NDCG (`--limit` 2, `--threshold` 0.05): 0.5385
151
+ * F1@5: 0.3232
152
+ * NDCG@5: 0.6283
153
+
154
+ # Evaluation
155
+
156
+ <!-- This section describes the evaluation protocols and provides the results, or cites relevant papers. -->
157
+
158
+
159
+ ## Testing Data, Factors and Metrics
160
+
161
+ ### Testing Data
162
+
163
+ The dataset is described under [Training Data](#training-data). It was split into smaller subsets used for training, testing and validating (80%/10%/10% split).
164
+
165
+
166
+ ### Metrics
167
+
168
+ Model performance has been evaluated based on the following metrics: Precision, Recall, F1 and NDCG. These are standard metrics for machine learning and more specifically for automatic subject indexing tasks and are directly provided in Annif by calling the ```annif eval``` statement. Evaluation parameters (limit = maximum number of results to return; threshold = minimum confidence for a suggestion to be considered) have been optimized before using the test subset and chosen based on the best F1 score reached. They affect the final results accordingly. We also state F1@5 and NDCG@5 scores reached without any evaluation parameters.
169
+
170
+ # Environmental Impact
171
+
172
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
173
+
174
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
175
+
176
+ - **Hardware Type:** A100.
177
+ - **Hours used:** 0,5-1 hour.
178
+ - **Cloud Provider:** No cloud.
179
+ - **Compute Region:** Germany.
180
+ - **Carbon Emitted:** More information needed.
181
+
182
+ # Technical Specifications
183
+
184
+ ## Model Architecture and Objective
185
+
186
+ See the [Annif Wiki on the Omikuji algorithm](https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji).
187
+
188
+ ### Software
189
+
190
+ To run this model, Annif version 1.1.0 must be installed.
191
+
192
+ # Model Card Authors
193
+
194
+ [Sophie Schneider](sophie.schneider@sbb.spk-berlin.de) and [Jörg Lehmann](joerg.lehmann@sbb.spk-berlin.de)
195
+
196
+ # Model Card Contact
197
+
198
+ Questions and comments about the model can be directed to Sophie Schneider at sophie.schneider@sbb.spk-berlin.de, questions and comments about the model card can be directed to Jörg Lehmann at joerg.lehmann@sbb.spk-berlin.de
199
+
200
+ # How to Get Started with the Model
201
+
202
+ Follow the Annif [Getting Started](https://github.com/NatLibFi/Annif/wiki/Getting-started) page to set up and run Annif. Create a projects.cfg file (see section [Training hyperparameters](#training-hyperparameters) for details on the specific project configuration), load the BK vocabulary via ```annif load-vocab``` command and copy the model folder over to ```data/projects```.
203
+
204
+
205
+ Model Card as of October 20th, 2025
206
+
207
+