README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# iPBL – Literary Genre Classification (HerBERT)
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This model implements the **literary genre classification** component of the iPBL (Bibliography of Polish Digital Culture) system developed at the Institute of Literary Research of the Polish Academy of Sciences.
|
| 6 |
+
|
| 7 |
+
It assigns domain-specific literary form categories to Polish web-based cultural texts.
|
| 8 |
+
The model supports structured bibliographic description within a historically established classificatory framework derived from the Polish Literary Bibliography (PBL).
|
| 9 |
+
|
| 10 |
+
Unlike general-purpose genre classifiers, this model operates within a discipline-specific bibliographic regime.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Task
|
| 15 |
+
|
| 16 |
+
Single-label multi-class text classification.
|
| 17 |
+
|
| 18 |
+
Each document is assigned one dominant literary genre category.
|
| 19 |
+
|
| 20 |
+
### Genres
|
| 21 |
+
|
| 22 |
+
- artykuł
|
| 23 |
+
- esej
|
| 24 |
+
- felieton
|
| 25 |
+
- kult
|
| 26 |
+
- nota
|
| 27 |
+
- opowiadanie
|
| 28 |
+
- proza
|
| 29 |
+
- recenzja
|
| 30 |
+
- wiersz
|
| 31 |
+
- wpis blogowy
|
| 32 |
+
- wspomnienie
|
| 33 |
+
- wywiad
|
| 34 |
+
- zgon
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## Base Model
|
| 39 |
+
|
| 40 |
+
`allegro/herbert-base-cased`
|
| 41 |
+
|
| 42 |
+
Architecture: `BertForSequenceClassification`
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## Training Data
|
| 47 |
+
|
| 48 |
+
The model was trained on curated bibliographic records produced in everyday bibliographic practice within the iPBL project.
|
| 49 |
+
|
| 50 |
+
Raw samples: 17,731
|
| 51 |
+
Final samples (after frequency filtering ≥ 100): 17,486
|
| 52 |
+
|
| 53 |
+
Data split:
|
| 54 |
+
|
| 55 |
+
- 70% Training
|
| 56 |
+
- 10% Validation
|
| 57 |
+
- 20% Test
|
| 58 |
+
|
| 59 |
+
The dataset reflects real-world class imbalance typical of web-native literary discourse.
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Performance (Test Set)
|
| 64 |
+
|
| 65 |
+
- Accuracy: **85.13%**
|
| 66 |
+
- Weighted F1-score: **0.85**
|
| 67 |
+
|
| 68 |
+
### High-performing genres
|
| 69 |
+
|
| 70 |
+
| Genre | F1-score |
|
| 71 |
+
|------------|----------|
|
| 72 |
+
| wiersz | 0.94 |
|
| 73 |
+
| wywiad | 0.94 |
|
| 74 |
+
| recenzja | 0.92 |
|
| 75 |
+
| artykuł | 0.85 |
|
| 76 |
+
|
| 77 |
+
Lower performance is observed for structurally hybrid and low-resource genres (e.g., esej, nota).
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## How to Use
|
| 82 |
+
|
| 83 |
+
### Standard Transformers Usage
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
from transformers import pipeline
|
| 87 |
+
|
| 88 |
+
classifier = pipeline(
|
| 89 |
+
"text-classification",
|
| 90 |
+
model="darekpe79/Literary_Genre_Classification",
|
| 91 |
+
tokenizer="darekpe79/Literary_Genre_Classification"
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
text = "Przykładowy tekst artykułu literackiego..."
|
| 95 |
+
classifier(text)
|