darekpe79
/

Subject_Heading

Model card Files Files and versions

darekpe79 commited on Feb 24

Commit

f71cd96

·

verified ·

1 Parent(s): 3a6fd94

readme.md

Files changed (1) hide show

README.md +105 -0

README.md ADDED Viewed

	@@ -0,0 +1,105 @@

+# iPBL – Subject Heading Classification (HerBERT)
+## Overview
+This model implements the **subject heading assignment** component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.
+It supports bibliographic description of Polish web-based literary and cultural texts by assigning **controlled subject heading sections aligned with the Polish Literary Bibliography (PBL)** classification system.
+The model predicts specific **PBL subject heading sections**, not general-purpose thematic categories.
+---
+## Task Formulation
+Single-label multi-class text classification.
+Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering.
+Only subject headings with at least **100 occurrences** in the dataset were included in the final supervised model.
+---
+## Training Data
+Raw subject heading annotations (before filtering): **17,678**
+After filtering (frequency ≥ 100): **15,185 samples**
+Final number of labels: **14 PBL subject heading sections**
+Data split:
+- 70% Training
+- 10% Validation
+- 20% Test
+Annotations originate from curated bibliographic work conducted within iPBL.
+---
+## Distribution of Retained Classes
+| Subject heading section | Number of samples |
+|-------------------------|------------------|
+| 2.14. Hasła osobowe | 8399 |
+| 4.4.9.1. W kraju | 1980 |
+| 2.8.10.5. Nagrody | 913 |
+| 3.9.11. Hasła osobowe | 795 |
+| 2.8.10.2. Festiwale | 520 |
+| 4.3. Hasła osobowe | 512 |
+| 3.29.11. Hasła osobowe | 438 |
+| 4.5.5. Filmy polskie | 394 |
+| 2.8.10.4. Konkursy | 303 |
+| 2.8.2. Życie literackie w ośrodkach | 270 |
+| 3.55.11. Hasła osobowe | 241 |
+| 4.4.6.3.2. Festiwale | 168 |
+| 3.149.11. Hasła osobowe | 146 |
+| 2.8.10.8. Spotkania autorskie | 106 |
+Categories with fewer than 100 instances were excluded from the model.
+---
+## Base Model
+- **Base architecture:** `allegro/herbert-base-cased`
+- **Model type:** `BertForSequenceClassification`
+- **Tokenizer:** `HerbertTokenizerFast`
+- **Number of labels:** 14
+---
+## Performance
+Instance-level evaluation on the test set:
+**Overall Accuracy: 89.96%**
+Performance strongly correlates with category frequency.
+Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness.
+---
+## Interpretation
+The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL.
+These represent distinct bibliographic classification contexts rather than redundant labels.
+Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error.
+---
+## How to Use
+```python
+from transformers import pipeline
+clf = pipeline(
+    "text-classification",
+    model="darekpe79/Subject_Heading_Classification",
+    tokenizer="darekpe79/Subject_Heading_Classification"
+)
+text = "Tytuł artykułu. Treść artykułu..."
+clf(text)