noepsl
/

distilbert-book-genre-classification

@@ -10,7 +10,49 @@ base_model:
 pipeline_tag: text-classification
 ---
 >[!NOTE]
 > This model was primarily trained and published for pedagogical purposes. It was not extensively engineered, nor optimized for performance.
-# Book Genre Classification with BERT

 pipeline_tag: text-classification
 ---
+# Book Genre Classification with BERT
 >[!NOTE]
 > This model was primarily trained and published for pedagogical purposes. It was not extensively engineered, nor optimized for performance.
+> The notebook used to trained this model can be found [here](https://github.com/d-noe/NLP_DH_PSL_Fall2025/blob/main/code/3_supervised/Tutorial_3_BGC.ipynb).
+This model is an adapted version of [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased), trained on samples from [Despina/project_gutenberg](https://huggingface.co/Despina/project_gutenberg) dataset.
+The model is trained on five-sentences long textual excerpts to determine the genre of the book they were extracted from, amongst:
+- 0: adventure stories
+- 1: children's stories
+- 2: detective and mystery stories
+- 3: science fiction
+## Training Details
+The data used for this experiment is a **corpus of five-sentences long chunks extracted from fiction novels** sourced from [Project Gutenberg](https://www.gutenberg.org/). It is a filtered version of the dataset introduced in [Christou & Tsoumakas (2025)](https://aclanthology.org/2025.latechclfl-1.13/) (you can find the original dataset at [Despina/project_gutenberg](https://huggingface.co/datasets/Despina/project_gutenberg)).
+Each excerpt is (exclusively) **associated with one of four genres** comprising: `adventure stories`, `children's stories`, `detective and mystery stories`, and `science fiction`.
+Note that the samples taken from the [original dataset](https://huggingface.co/datasets/Despina/project_gutenberg) were purposefully filtered to be **balanced** across the four genres mentionned above. Moreover, the sampling script tried, as much as possible, to impose parity between (binary) inferred gender of the authors (— yet the resulting dataset is still heavily leaning towards 'male'-written books for 'adventures stories' (88.5%) and 'science fiction' (92.7%)).
+You can find, and reuse, the script used to filter and sample the data [here](https://github.com/d-noe/NLP_DH_PSL_Fall2025/blob/main/data/preprocessing/prepro-books_chunks.py).
+The filtered dataset used to train the model contains ~19.5K samples, further split into train (15,606), validation (1,975) and test (1,954) sets.
+Note that the split were made at the book level in order to avoid important data leakage between training and evaluation sets (i.e. there cannot be excerpts from one book in two different splits).
+## Performance
+The performances of the fine-tuned model on the test set are the following:
+| Class                         | Precision | Recall | F1-Score | Support |
+|-------------------------------|------------|---------|-----------|----------|
+| Adventure Stories             | 0.58       | 0.47    | 0.52      | 444      |
+| Children's Stories            | 0.68       | 0.80    | 0.74      | 490      |
+| Detective and Mystery Stories | 0.59       | 0.70    | 0.64      | 440      |
+| Science Fiction               | 0.83       | 0.71    | 0.76      | 580      |
+| **Accuracy**                  |            |         | 0.68      | 1954     |
+| **Macro Avg**                 | 0.67       | 0.67    | 0.66      | 1954     |
+| **Weighted Avg**              | 0.68       | 0.68    | 0.67      | 1954     |
+## Computational Resources
+The model was trained for 4 epochs on a NVIDIA RTX6000 GPU, lasting about 6.5 minutes.