| # iPBL – Subject Heading Classification (HerBERT) |
|
|
| ## Overview |
|
|
| This model implements the **subject heading assignment** component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences. |
|
|
| It supports bibliographic description of Polish web-based literary and cultural texts by assigning **controlled subject heading sections aligned with the Polish Literary Bibliography (PBL)** classification system. |
|
|
| The model predicts specific **PBL subject heading sections**, not general-purpose thematic categories. |
|
|
| --- |
|
|
| ## Task Formulation |
|
|
| Single-label multi-class text classification. |
|
|
| Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering. |
|
|
| Only subject headings with at least **100 occurrences** in the dataset were included in the final supervised model. |
|
|
| --- |
|
|
| ## Training Data |
|
|
| Raw subject heading annotations (before filtering): **17,678** |
|
|
| After filtering (frequency ≥ 100): **15,185 samples** |
|
|
| Final number of labels: **14 PBL subject heading sections** |
|
|
| Data split: |
|
|
| - 70% Training |
| - 10% Validation |
| - 20% Test |
|
|
| Annotations originate from curated bibliographic work conducted within iPBL. |
|
|
| --- |
|
|
| ## Distribution of Retained Classes |
|
|
| | Subject heading section | Number of samples | |
| |-------------------------|------------------| |
| | 2.14. Hasła osobowe | 8399 | |
| | 4.4.9.1. W kraju | 1980 | |
| | 2.8.10.5. Nagrody | 913 | |
| | 3.9.11. Hasła osobowe | 795 | |
| | 2.8.10.2. Festiwale | 520 | |
| | 4.3. Hasła osobowe | 512 | |
| | 3.29.11. Hasła osobowe | 438 | |
| | 4.5.5. Filmy polskie | 394 | |
| | 2.8.10.4. Konkursy | 303 | |
| | 2.8.2. Życie literackie w ośrodkach | 270 | |
| | 3.55.11. Hasła osobowe | 241 | |
| | 4.4.6.3.2. Festiwale | 168 | |
| | 3.149.11. Hasła osobowe | 146 | |
| | 2.8.10.8. Spotkania autorskie | 106 | |
|
|
| Categories with fewer than 100 instances were excluded from the model. |
|
|
| --- |
|
|
| ## Base Model |
|
|
| - **Base architecture:** `allegro/herbert-base-cased` |
| - **Model type:** `BertForSequenceClassification` |
| - **Tokenizer:** `HerbertTokenizerFast` |
| - **Number of labels:** 14 |
|
|
| --- |
|
|
| ## Performance |
|
|
| Instance-level evaluation on the test set: |
|
|
| **Overall Accuracy: 89.96%** |
|
|
| Performance strongly correlates with category frequency. |
| Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness. |
|
|
| --- |
|
|
| ## Interpretation |
|
|
| The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL. |
| These represent distinct bibliographic classification contexts rather than redundant labels. |
|
|
| Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error. |
|
|
| --- |
|
|
| ## How to Use |
|
|
| ```python |
| from transformers import pipeline |
| |
| clf = pipeline( |
| "text-classification", |
| model="darekpe79/Subject_Heading_Classification", |
| tokenizer="darekpe79/Subject_Heading_Classification" |
| ) |
| |
| text = "Tytuł artykułu. Treść artykułu..." |
| clf(text) |