YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
iPBL – Subject Heading Classification (HerBERT)
Overview
This model implements the subject heading assignment component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.
It supports bibliographic description of Polish web-based literary and cultural texts by assigning controlled subject heading sections aligned with the Polish Literary Bibliography (PBL) classification system.
The model predicts specific PBL subject heading sections, not general-purpose thematic categories.
Task Formulation
Single-label multi-class text classification.
Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering.
Only subject headings with at least 100 occurrences in the dataset were included in the final supervised model.
Training Data
Raw subject heading annotations (before filtering): 17,678
After filtering (frequency ≥ 100): 15,185 samples
Final number of labels: 14 PBL subject heading sections
Data split:
- 70% Training
- 10% Validation
- 20% Test
Annotations originate from curated bibliographic work conducted within iPBL.
Distribution of Retained Classes
| Subject heading section | Number of samples |
|---|---|
| 2.14. Hasła osobowe | 8399 |
| 4.4.9.1. W kraju | 1980 |
| 2.8.10.5. Nagrody | 913 |
| 3.9.11. Hasła osobowe | 795 |
| 2.8.10.2. Festiwale | 520 |
| 4.3. Hasła osobowe | 512 |
| 3.29.11. Hasła osobowe | 438 |
| 4.5.5. Filmy polskie | 394 |
| 2.8.10.4. Konkursy | 303 |
| 2.8.2. Życie literackie w ośrodkach | 270 |
| 3.55.11. Hasła osobowe | 241 |
| 4.4.6.3.2. Festiwale | 168 |
| 3.149.11. Hasła osobowe | 146 |
| 2.8.10.8. Spotkania autorskie | 106 |
Categories with fewer than 100 instances were excluded from the model.
Base Model
- Base architecture:
allegro/herbert-base-cased - Model type:
BertForSequenceClassification - Tokenizer:
HerbertTokenizerFast - Number of labels: 14
Performance
Instance-level evaluation on the test set:
Overall Accuracy: 89.96%
Performance strongly correlates with category frequency.
Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness.
Interpretation
The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL.
These represent distinct bibliographic classification contexts rather than redundant labels.
Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error.
How to Use
from transformers import pipeline
clf = pipeline(
"text-classification",
model="darekpe79/Subject_Heading_Classification",
tokenizer="darekpe79/Subject_Heading_Classification"
)
text = "Tytuł artykułu. Treść artykułu..."
clf(text)
- Downloads last month
- -