Subject_Heading / README.md
darekpe79's picture
readme.md
f71cd96 verified

iPBL – Subject Heading Classification (HerBERT)

Overview

This model implements the subject heading assignment component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.

It supports bibliographic description of Polish web-based literary and cultural texts by assigning controlled subject heading sections aligned with the Polish Literary Bibliography (PBL) classification system.

The model predicts specific PBL subject heading sections, not general-purpose thematic categories.


Task Formulation

Single-label multi-class text classification.

Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering.

Only subject headings with at least 100 occurrences in the dataset were included in the final supervised model.


Training Data

Raw subject heading annotations (before filtering): 17,678

After filtering (frequency ≥ 100): 15,185 samples

Final number of labels: 14 PBL subject heading sections

Data split:

  • 70% Training
  • 10% Validation
  • 20% Test

Annotations originate from curated bibliographic work conducted within iPBL.


Distribution of Retained Classes

Subject heading section Number of samples
2.14. Hasła osobowe 8399
4.4.9.1. W kraju 1980
2.8.10.5. Nagrody 913
3.9.11. Hasła osobowe 795
2.8.10.2. Festiwale 520
4.3. Hasła osobowe 512
3.29.11. Hasła osobowe 438
4.5.5. Filmy polskie 394
2.8.10.4. Konkursy 303
2.8.2. Życie literackie w ośrodkach 270
3.55.11. Hasła osobowe 241
4.4.6.3.2. Festiwale 168
3.149.11. Hasła osobowe 146
2.8.10.8. Spotkania autorskie 106

Categories with fewer than 100 instances were excluded from the model.


Base Model

  • Base architecture: allegro/herbert-base-cased
  • Model type: BertForSequenceClassification
  • Tokenizer: HerbertTokenizerFast
  • Number of labels: 14

Performance

Instance-level evaluation on the test set:

Overall Accuracy: 89.96%

Performance strongly correlates with category frequency.
Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness.


Interpretation

The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL.
These represent distinct bibliographic classification contexts rather than redundant labels.

Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error.


How to Use

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="darekpe79/Subject_Heading_Classification",
    tokenizer="darekpe79/Subject_Heading_Classification"
)

text = "Tytuł artykułu. Treść artykułu..."
clf(text)