noepsl commited on
Commit
5c8786b
·
verified ·
1 Parent(s): 6161ab8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -1
README.md CHANGED
@@ -10,7 +10,49 @@ base_model:
10
  pipeline_tag: text-classification
11
  ---
12
 
 
 
13
  >[!NOTE]
14
  > This model was primarily trained and published for pedagogical purposes. It was not extensively engineered, nor optimized for performance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- # Book Genre Classification with BERT
 
10
  pipeline_tag: text-classification
11
  ---
12
 
13
+ # Book Genre Classification with BERT
14
+
15
  >[!NOTE]
16
  > This model was primarily trained and published for pedagogical purposes. It was not extensively engineered, nor optimized for performance.
17
+ > The notebook used to trained this model can be found [here](https://github.com/d-noe/NLP_DH_PSL_Fall2025/blob/main/code/3_supervised/Tutorial_3_BGC.ipynb).
18
+
19
+
20
+ This model is an adapted version of [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased), trained on samples from [Despina/project_gutenberg](https://huggingface.co/Despina/project_gutenberg) dataset.
21
+ The model is trained on five-sentences long textual excerpts to determine the genre of the book they were extracted from, amongst:
22
+ - 0: adventure stories
23
+ - 1: children's stories
24
+ - 2: detective and mystery stories
25
+ - 3: science fiction
26
+
27
+ ## Training Details
28
+
29
+ The data used for this experiment is a **corpus of five-sentences long chunks extracted from fiction novels** sourced from [Project Gutenberg](https://www.gutenberg.org/). It is a filtered version of the dataset introduced in [Christou & Tsoumakas (2025)](https://aclanthology.org/2025.latechclfl-1.13/) (you can find the original dataset at [Despina/project_gutenberg](https://huggingface.co/datasets/Despina/project_gutenberg)).
30
+
31
+ Each excerpt is (exclusively) **associated with one of four genres** comprising: `adventure stories`, `children's stories`, `detective and mystery stories`, and `science fiction`.
32
+
33
+ Note that the samples taken from the [original dataset](https://huggingface.co/datasets/Despina/project_gutenberg) were purposefully filtered to be **balanced** across the four genres mentionned above. Moreover, the sampling script tried, as much as possible, to impose parity between (binary) inferred gender of the authors (— yet the resulting dataset is still heavily leaning towards 'male'-written books for 'adventures stories' (88.5%) and 'science fiction' (92.7%)).
34
+ You can find, and reuse, the script used to filter and sample the data [here](https://github.com/d-noe/NLP_DH_PSL_Fall2025/blob/main/data/preprocessing/prepro-books_chunks.py).
35
+
36
+ The filtered dataset used to train the model contains ~19.5K samples, further split into train (15,606), validation (1,975) and test (1,954) sets.
37
+ Note that the split were made at the book level in order to avoid important data leakage between training and evaluation sets (i.e. there cannot be excerpts from one book in two different splits).
38
+
39
+
40
+ ## Performance
41
+
42
+ The performances of the fine-tuned model on the test set are the following:
43
+
44
+ | Class | Precision | Recall | F1-Score | Support |
45
+ |-------------------------------|------------|---------|-----------|----------|
46
+ | Adventure Stories | 0.58 | 0.47 | 0.52 | 444 |
47
+ | Children's Stories | 0.68 | 0.80 | 0.74 | 490 |
48
+ | Detective and Mystery Stories | 0.59 | 0.70 | 0.64 | 440 |
49
+ | Science Fiction | 0.83 | 0.71 | 0.76 | 580 |
50
+ | **Accuracy** | | | 0.68 | 1954 |
51
+ | **Macro Avg** | 0.67 | 0.67 | 0.66 | 1954 |
52
+ | **Weighted Avg** | 0.68 | 0.68 | 0.67 | 1954 |
53
+
54
+
55
+ ## Computational Resources
56
+
57
+ The model was trained for 4 epochs on a NVIDIA RTX6000 GPU, lasting about 6.5 minutes.
58