darekpe79 commited on
Commit
f71cd96
·
verified ·
1 Parent(s): 3a6fd94

readme.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # iPBL – Subject Heading Classification (HerBERT)
2
+
3
+ ## Overview
4
+
5
+ This model implements the **subject heading assignment** component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.
6
+
7
+ It supports bibliographic description of Polish web-based literary and cultural texts by assigning **controlled subject heading sections aligned with the Polish Literary Bibliography (PBL)** classification system.
8
+
9
+ The model predicts specific **PBL subject heading sections**, not general-purpose thematic categories.
10
+
11
+ ---
12
+
13
+ ## Task Formulation
14
+
15
+ Single-label multi-class text classification.
16
+
17
+ Each document instance is assigned to one of the most frequent PBL subject heading sections retained after frequency filtering.
18
+
19
+ Only subject headings with at least **100 occurrences** in the dataset were included in the final supervised model.
20
+
21
+ ---
22
+
23
+ ## Training Data
24
+
25
+ Raw subject heading annotations (before filtering): **17,678**
26
+
27
+ After filtering (frequency ≥ 100): **15,185 samples**
28
+
29
+ Final number of labels: **14 PBL subject heading sections**
30
+
31
+ Data split:
32
+
33
+ - 70% Training
34
+ - 10% Validation
35
+ - 20% Test
36
+
37
+ Annotations originate from curated bibliographic work conducted within iPBL.
38
+
39
+ ---
40
+
41
+ ## Distribution of Retained Classes
42
+
43
+ | Subject heading section | Number of samples |
44
+ |-------------------------|------------------|
45
+ | 2.14. Hasła osobowe | 8399 |
46
+ | 4.4.9.1. W kraju | 1980 |
47
+ | 2.8.10.5. Nagrody | 913 |
48
+ | 3.9.11. Hasła osobowe | 795 |
49
+ | 2.8.10.2. Festiwale | 520 |
50
+ | 4.3. Hasła osobowe | 512 |
51
+ | 3.29.11. Hasła osobowe | 438 |
52
+ | 4.5.5. Filmy polskie | 394 |
53
+ | 2.8.10.4. Konkursy | 303 |
54
+ | 2.8.2. Życie literackie w ośrodkach | 270 |
55
+ | 3.55.11. Hasła osobowe | 241 |
56
+ | 4.4.6.3.2. Festiwale | 168 |
57
+ | 3.149.11. Hasła osobowe | 146 |
58
+ | 2.8.10.8. Spotkania autorskie | 106 |
59
+
60
+ Categories with fewer than 100 instances were excluded from the model.
61
+
62
+ ---
63
+
64
+ ## Base Model
65
+
66
+ - **Base architecture:** `allegro/herbert-base-cased`
67
+ - **Model type:** `BertForSequenceClassification`
68
+ - **Tokenizer:** `HerbertTokenizerFast`
69
+ - **Number of labels:** 14
70
+
71
+ ---
72
+
73
+ ## Performance
74
+
75
+ Instance-level evaluation on the test set:
76
+
77
+ **Overall Accuracy: 89.96%**
78
+
79
+ Performance strongly correlates with category frequency.
80
+ Dominant categories (e.g., 2.14. Hasła osobowe) achieve higher stability, while low-support categories show reduced robustness.
81
+
82
+ ---
83
+
84
+ ## Interpretation
85
+
86
+ The presence of multiple “Hasła osobowe” sections reflects the internal hierarchical structure of PBL.
87
+ These represent distinct bibliographic classification contexts rather than redundant labels.
88
+
89
+ Model uncertainty should be interpreted as analytically meaningful within domain-specific bibliographic indexing rather than purely technical error.
90
+
91
+ ---
92
+
93
+ ## How to Use
94
+
95
+ ```python
96
+ from transformers import pipeline
97
+
98
+ clf = pipeline(
99
+ "text-classification",
100
+ model="darekpe79/Subject_Heading_Classification",
101
+ tokenizer="darekpe79/Subject_Heading_Classification"
102
+ )
103
+
104
+ text = "Tytuł artykułu. Treść artykułu..."
105
+ clf(text)