Create Model Card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,207 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- SBB/BK-Training-Dataset
|
| 5 |
+
pipeline_tag: text-classification
|
| 6 |
+
tags:
|
| 7 |
+
- annif
|
| 8 |
+
- subject_indexing
|
| 9 |
+
- glam
|
| 10 |
+
- lam
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Model Card for bk-omikuji
|
| 14 |
+
|
| 15 |
+
An [Annif](https://annif.org/) model, trained on bibliographic metadata for automatic subject indexing tasks. It classifies a given text such as a title or description into one or multiple subjects from the classification system *Nederlandse basisclassificatie* / [Basisklassifikation](http://uri.gbv.de/terminology/bk/) (BK). The model was developed in the research project [Human.Machine.Culture](https://mmk.sbb.berlin/?lang=en) at Staatsbibliothek zu Berlin – Berlin State Library (SBB).
|
| 16 |
+
|
| 17 |
+
Questions and comments about the model can be directed to Sophie Schneider at Sophie.Schneider@sbb.spk-berlin.de.
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# Table of Contents
|
| 21 |
+
|
| 22 |
+
- [Model Card for bk-omikuji](#model-card-for-bk-omikuji)
|
| 23 |
+
- [Table of Contents](#table-of-contents)
|
| 24 |
+
- [Model Details](#model-details)
|
| 25 |
+
- [Model Description](#model-description)
|
| 26 |
+
- [Uses](#uses)
|
| 27 |
+
- [Direct Use](#direct-use)
|
| 28 |
+
- [Downstream Use](#downstream-use)
|
| 29 |
+
- [Out-of-Scope Use](#out-of-scope-use)
|
| 30 |
+
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
| 31 |
+
- [Recommendations](#recommendations)
|
| 32 |
+
- [Training Details](#training-details)
|
| 33 |
+
- [Training Data](#training-data)
|
| 34 |
+
- [Training Procedure](#training-procedure)
|
| 35 |
+
- [Preprocessing](#preprocessing)
|
| 36 |
+
- [Speeds, Sizes, Times](#speeds-sizes-times)
|
| 37 |
+
- [Training hyperparameters](#training-hyperparameters)
|
| 38 |
+
- [Training results](#training-results)
|
| 39 |
+
- [Evaluation](#evaluation)
|
| 40 |
+
- [Testing Data and Metrics](#testing-data-and-metrics)
|
| 41 |
+
- [Testing Data](#testing-data)
|
| 42 |
+
- [Metrics](#metrics)
|
| 43 |
+
- [Results](#results)
|
| 44 |
+
- [Model Examination](#model-examination)
|
| 45 |
+
- [Environmental Impact](#environmental-impact)
|
| 46 |
+
- [Technical Specifications](#technical-specifications)
|
| 47 |
+
- [Model Architecture and Objective](#model-architecture-and-objective)
|
| 48 |
+
- [Software](#software)
|
| 49 |
+
- [Model Card Authors](#model-card-authors)
|
| 50 |
+
- [Model Card Contact](#model-card-contact)
|
| 51 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
# Model Details
|
| 56 |
+
|
| 57 |
+
## Model Description
|
| 58 |
+
|
| 59 |
+
An [Annif](https://annif.org/) model, trained on title/catalogue metadata for automatic subject indexing tasks. Subject indexing is a classical library task, aiming at describing the content of a resource. The model is intended to be used to automatically classify titles or texts that have not been classified manually so far. For each input, the model outputs one or multiple subjects from the [BK](http://uri.gbv.de/terminology/bk/) classification system. It is part of a collection of 3 models, created with the help of the Annif toolkit, which address the task of automated subject indexing.
|
| 60 |
+
|
| 61 |
+
- **Developed by:** [Sophie Schneider](Sophie.Schneider@sbb.spk-berlin.de)
|
| 62 |
+
- **Shared by [Optional]:** [Staatsbibliothek zu Berlin / Berlin State Library](https://huggingface.co/SBB)
|
| 63 |
+
- **Model type:** tree-based
|
| 64 |
+
- **Language(s) (NLP):** multilingual
|
| 65 |
+
- **License:** apache-2.0
|
| 66 |
+
|
| 67 |
+
# Uses
|
| 68 |
+
|
| 69 |
+
## Direct Use
|
| 70 |
+
|
| 71 |
+
This model can directly be used to automatically classify texts with the BK classification scheme. It is intended to be used together with the Annif automated subject indexing toolkit version 1.1.0.
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
## Downstream Use
|
| 76 |
+
|
| 77 |
+
Other/downstream uses outside of the Annif setting described above are not intended but also not excluded.
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
## Out-of-Scope Use
|
| 82 |
+
|
| 83 |
+
The task of the model is primarily the subject indexing of bibliographic metadata. Basically, the model can be applied to any kind of text describing publications. However, it might not be suitable for colloquial texts from the internet.
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
# Bias, Risks, and Limitations
|
| 88 |
+
|
| 89 |
+
The BK was introduced in the 1980s and therefore its structure partly represents outdated ways of thinking. In itself, the BK is biased, for some examples see [https://verbundkonferenz.gbv.de/wp-content/uploads/2024/09/2024-08-20_VK_beckmann_Kunst-oder-Krempel-Potenziale-der-Basisklassifikation.pdf](https://verbundkonferenz.gbv.de/wp-content/uploads/2024/09/2024-08-20_VK_beckmann_Kunst-oder-Krempel-Potenziale-der-Basisklassifikation.pdf), slide 17. Hence, typical biases are e.g. the binary understanding of gender roles, outmoded use of geographical designators or of social movements. However, in order to keep it up to date, the classification system is under constant revision. As of now, the classes suggested for an input text might not be suitable for today’s understanding and might not conform to contemporary values.
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
## Recommendations
|
| 94 |
+
|
| 95 |
+
This BK model has been trained solely on the basis of title data. We recommend being aware of this limitation and, if available, to use additional training data such as full texts or tables of contents to enhance the existing model (e.g. by running `annif learn` or combining models trained with different kinds of training data to an ensemble model).
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
# Training Details
|
| 100 |
+
|
| 101 |
+
## Training Data
|
| 102 |
+
|
| 103 |
+
The **vocabulary file** for the BK was downloaded directly from [https://api.dante.gbv.de/export/download/bk/default/](https://api.dante.gbv.de/export/download/bk/default/) (`bk__default.turtle.ttl` as of 05/2024).
|
| 104 |
+
|
| 105 |
+
For the **subject data**, several versions were created for each model with slight adaptations.
|
| 106 |
+
|
| 107 |
+
A dataset was generated following these steps:
|
| 108 |
+
- downloading the `kxp-subjects.tsv.gz` of Voß, J., & Verbundzentrale des GBV. (2024). Normalized subject indexing data of K10plus library union catalog (2024-02-26) [Data set]. VZG. [https://doi.org/10.5281/zenodo.10933926](https://doi.org/10.5281/zenodo.10933926)
|
| 109 |
+
- cleaning this file to only contain PPNs together with their BK notations
|
| 110 |
+
- querying the corresponding BK titles (Pica3 field [4000](https://swbtools.bsz-bw.de/cgi-bin/k10plushelp.pl?cmd=kat&val=4000), subfields a and d) for all PPNs from previous step via [unapi](http://unapi.k10plus.de/)
|
| 111 |
+
- merge the subject and title data based on PPN, filter out duplicates (identical titles)
|
| 112 |
+
- this led to a dataset of overall 6.114.950 entries, split into 80% train and 10% for test and validation subsets
|
| 113 |
+
|
| 114 |
+
The training dataset has been published on [Zenodo](https://doi.org/10.5281/zenodo.15690227).
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
## Training Procedure
|
| 119 |
+
|
| 120 |
+
The training procedure includes loading the BK vocabulary into Annif and training the [Omikuji backend](https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji) with the help of the respective training data. Further aspects on technical specifications can be found in the section [Training hyperparameters](#training-hyperparameters).
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
### Preprocessing
|
| 124 |
+
|
| 125 |
+
Besides merging and transforming the data as described under [Training Data](#training-data), no further preprocessing of natural language or similar has been performed in the dataset creation. For training the model, we applied the [simple analyzer](https://github.com/NatLibFi/Annif/wiki/Analyzers#simple-analyzer) provided by Annif.
|
| 126 |
+
|
| 127 |
+
### Speeds, Sizes, Times
|
| 128 |
+
|
| 129 |
+
Training takes from several minutes to an hour on an 2,8 GHz CPU with 32 cores, depending on the choice of dataset and algorithm as well as hyperparameter settings.
|
| 130 |
+
|
| 131 |
+
### Training hyperparameters
|
| 132 |
+
|
| 133 |
+
name=BK Omikuji
|
| 134 |
+
language=de
|
| 135 |
+
backend=omikuji
|
| 136 |
+
analyzer=simple
|
| 137 |
+
vocab=bk
|
| 138 |
+
cluster_balanced=False
|
| 139 |
+
cluster_k=100
|
| 140 |
+
max_depth=3
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
### Training results
|
| 144 |
+
|
| 145 |
+
documents evaluated: 5000
|
| 146 |
+
|
| 147 |
+
* Precision (`--limit` 2, `--threshold` 0.05): 0.4664
|
| 148 |
+
* Recall (`--limit` 2, `--threshold` 0.05): 0.5382
|
| 149 |
+
* F1 (`--limit` 2, `--threshold` 0.05): 0.4775
|
| 150 |
+
* NDCG (`--limit` 2, `--threshold` 0.05): 0.5385
|
| 151 |
+
* F1@5: 0.3232
|
| 152 |
+
* NDCG@5: 0.6283
|
| 153 |
+
|
| 154 |
+
# Evaluation
|
| 155 |
+
|
| 156 |
+
<!-- This section describes the evaluation protocols and provides the results, or cites relevant papers. -->
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
## Testing Data, Factors and Metrics
|
| 160 |
+
|
| 161 |
+
### Testing Data
|
| 162 |
+
|
| 163 |
+
The dataset is described under [Training Data](#training-data). It was split into smaller subsets used for training, testing and validating (80%/10%/10% split).
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
### Metrics
|
| 167 |
+
|
| 168 |
+
Model performance has been evaluated based on the following metrics: Precision, Recall, F1 and NDCG. These are standard metrics for machine learning and more specifically for automatic subject indexing tasks and are directly provided in Annif by calling the ```annif eval``` statement. Evaluation parameters (limit = maximum number of results to return; threshold = minimum confidence for a suggestion to be considered) have been optimized before using the test subset and chosen based on the best F1 score reached. They affect the final results accordingly. We also state F1@5 and NDCG@5 scores reached without any evaluation parameters.
|
| 169 |
+
|
| 170 |
+
# Environmental Impact
|
| 171 |
+
|
| 172 |
+
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 173 |
+
|
| 174 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 175 |
+
|
| 176 |
+
- **Hardware Type:** A100.
|
| 177 |
+
- **Hours used:** 0,5-1 hour.
|
| 178 |
+
- **Cloud Provider:** No cloud.
|
| 179 |
+
- **Compute Region:** Germany.
|
| 180 |
+
- **Carbon Emitted:** More information needed.
|
| 181 |
+
|
| 182 |
+
# Technical Specifications
|
| 183 |
+
|
| 184 |
+
## Model Architecture and Objective
|
| 185 |
+
|
| 186 |
+
See the [Annif Wiki on the Omikuji algorithm](https://github.com/NatLibFi/Annif/wiki/Backend%3A-Omikuji).
|
| 187 |
+
|
| 188 |
+
### Software
|
| 189 |
+
|
| 190 |
+
To run this model, Annif version 1.1.0 must be installed.
|
| 191 |
+
|
| 192 |
+
# Model Card Authors
|
| 193 |
+
|
| 194 |
+
[Sophie Schneider](sophie.schneider@sbb.spk-berlin.de) and [Jörg Lehmann](joerg.lehmann@sbb.spk-berlin.de)
|
| 195 |
+
|
| 196 |
+
# Model Card Contact
|
| 197 |
+
|
| 198 |
+
Questions and comments about the model can be directed to Sophie Schneider at sophie.schneider@sbb.spk-berlin.de, questions and comments about the model card can be directed to Jörg Lehmann at joerg.lehmann@sbb.spk-berlin.de
|
| 199 |
+
|
| 200 |
+
# How to Get Started with the Model
|
| 201 |
+
|
| 202 |
+
Follow the Annif [Getting Started](https://github.com/NatLibFi/Annif/wiki/Getting-started) page to set up and run Annif. Create a projects.cfg file (see section [Training hyperparameters](#training-hyperparameters) for details on the specific project configuration), load the BK vocabulary via ```annif load-vocab``` command and copy the model folder over to ```data/projects```.
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
Model Card as of October 20th, 2025
|
| 206 |
+
|
| 207 |
+
|