darekpe79
/

Subject_Heading

Model card Files Files and versions

Subject_Heading / README.md

darekpe79's picture

readme.md

f71cd96 verified about 2 months ago

|

history blame contribute delete

3.08 kB

	# iPBL – Subject Heading Classification (HerBERT)

	## Overview

	This model implements the subject heading assignment component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.

	It supports bibliographic description of Polish web-based literary and cultural texts by assigning controlled subject heading sections aligned with the Polish Literary Bibliography (PBL) classification system.

	The model predicts specific PBL subject heading sections, not general-purpose thematic categories.

	---

	## Task Formulation

	Single-label multi-class text classification.

	Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering.

	Only subject headings with at least 100 occurrences in the dataset were included in the final supervised model.

	---

	## Training Data

	Raw subject heading annotations (before filtering): 17,678

	After filtering (frequency ≥ 100): 15,185 samples

	Final number of labels: 14 PBL subject heading sections

	Data split:

	- 70% Training
	- 10% Validation
	- 20% Test

	Annotations originate from curated bibliographic work conducted within iPBL.

	---

	## Distribution of Retained Classes

	\| Subject heading section \| Number of samples \|
	\|-------------------------\|------------------\|
	\| 2.14. Hasła osobowe \| 8399 \|
	\| 4.4.9.1. W kraju \| 1980 \|
	\| 2.8.10.5. Nagrody \| 913 \|
	\| 3.9.11. Hasła osobowe \| 795 \|
	\| 2.8.10.2. Festiwale \| 520 \|
	\| 4.3. Hasła osobowe \| 512 \|
	\| 3.29.11. Hasła osobowe \| 438 \|
	\| 4.5.5. Filmy polskie \| 394 \|
	\| 2.8.10.4. Konkursy \| 303 \|
	\| 2.8.2. Życie literackie w ośrodkach \| 270 \|
	\| 3.55.11. Hasła osobowe \| 241 \|
	\| 4.4.6.3.2. Festiwale \| 168 \|
	\| 3.149.11. Hasła osobowe \| 146 \|
	\| 2.8.10.8. Spotkania autorskie \| 106 \|

	Categories with fewer than 100 instances were excluded from the model.

	---

	## Base Model

	- Base architecture: `allegro/herbert-base-cased`
	- Model type: `BertForSequenceClassification`
	- Tokenizer: `HerbertTokenizerFast`
	- Number of labels: 14

	---

	## Performance

	Instance-level evaluation on the test set:

	Overall Accuracy: 89.96%

	Performance strongly correlates with category frequency.
	Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness.

	---

	## Interpretation

	The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL.
	These represent distinct bibliographic classification contexts rather than redundant labels.

	Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error.

	---

	## How to Use

	```python
	from transformers import pipeline

	clf = pipeline(
	"text-classification",
	model="darekpe79/Subject_Heading_Classification",
	tokenizer="darekpe79/Subject_Heading_Classification"
	)

	text = "Tytuł artykułu. Treść artykułu..."
	clf(text)