iPBL – Subject Heading Classification (HerBERT)

Overview

This model implements the subject heading assignment component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.

It supports bibliographic description of Polish web-based literary and cultural texts by assigning controlled subject heading sections aligned with the Polish Literary Bibliography (PBL) classification system.

The model predicts specific PBL subject heading sections, not general-purpose thematic categories.

Task Formulation

Single-label multi-class text classification.

Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering.

Only subject headings with at least 100 occurrences in the dataset were included in the final supervised model.

Training Data

Raw subject heading annotations (before filtering): 17,678

After filtering (frequency ≥ 100): 15,185 samples

Final number of labels: 14 PBL subject heading sections

Data split:

70% Training
10% Validation
20% Test

Annotations originate from curated bibliographic work conducted within iPBL.

Distribution of Retained Classes

Subject heading section	Number of samples
2.14. Hasła osobowe	8399
4.4.9.1. W kraju	1980
2.8.10.5. Nagrody	913
3.9.11. Hasła osobowe	795
2.8.10.2. Festiwale	520
4.3. Hasła osobowe	512
3.29.11. Hasła osobowe	438
4.5.5. Filmy polskie	394
2.8.10.4. Konkursy	303
2.8.2. Życie literackie w ośrodkach	270
3.55.11. Hasła osobowe	241
4.4.6.3.2. Festiwale	168
3.149.11. Hasła osobowe	146
2.8.10.8. Spotkania autorskie	106

Categories with fewer than 100 instances were excluded from the model.

Base Model

Base architecture: allegro/herbert-base-cased
Model type: BertForSequenceClassification
Tokenizer: HerbertTokenizerFast
Number of labels: 14

Performance

Instance-level evaluation on the test set:

Overall Accuracy: 89.96%

Performance strongly correlates with category frequency.
Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness.

Interpretation

The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL.
These represent distinct bibliographic classification contexts rather than redundant labels.

Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error.

How to Use

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="darekpe79/Subject_Heading_Classification",
    tokenizer="darekpe79/Subject_Heading_Classification"
)

text = "Tytuł artykułu. Treść artykułu..."
clf(text)