---
license: mit
datasets:
- SaladTechnologies/fiction-ner-750m
language:
- en
base_model:
- microsoft/deberta-v3-base
pipeline_tag: token-classification
---

# FABLE - Fiction Adapted BERT for Literary Entities

This is a named-entity recognition (NER) model called FABLE, which stands for Fiction Adapted BERT for Literary Entities. It is based on the DeBERTa v3 architecture and has been fine-tuned on the [Fiction-NER-750M dataset](https://huggingface.co/datasets/SaladTechnologies/fiction-ner-750m) of literary texts to recognize entities such as characters, locations, and other relevant terms in fiction.

## Model Details

### Model Description

FABLE is a transformer-based model designed for named-entity recognition (NER) tasks in literary texts. It has been fine-tuned on a large dataset of fiction to accurately identify and classify entities such as characters, locations, and other relevant terms.

Entity labels are in [BIO Tagging format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning the beginning of an entity is prefixed with `B-`, and tokens which are a continuation of that entity are prefixed with `I-`.

For example, the tokens `Arthur, Funkleton` would be tagged `B-CHA, I-CHA`, indicating that both tokens belong to the same Character entity.

- `O` - Outside / Not a Named Entity
- `CHA` - Character
- `LOC` - Location
- `FAC` - Facility
- `OBJ` - Important Object
- `EVT` - Event
- `ORG` - Organization
- `MISC` - Other Named Entity

### Model Specifications

- **Developed by:** Shawn Rushefsky - [🤗](https://huggingface.co/shawnrushefsky) | [github](https://github.com/shawnrushefsky)
- **Funded by:** [Salad Technologies](https://salad.com)
- **Model type:** NER / Token Classification
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** microsoft/deberta-v3-base

## Uses

This model is intended to be used in the analysis of literary texts, such as novels and short stories, to identify and classify named entities.

## Bias, Risks, and Limitations

The training data comes from a diverse set of english-language narrative fiction spanning hundreds of years of authorship, and may include subject matter and phrasing that offend.
The age of much of the material from Project Gutenberg is such that white men from before the civil rights movement are vastly disproportionately represented as authors.
Additionally, contemporary commercial fiction is nearly all but excluded due to licensing restrictions.

### Recommendations

Use at your own risk. This model is provided as-is, without warranty of any kind.

## How to Get Started with the Model

```python
from transformers import pipeline

pipe = pipeline("token-classification", model="SaladTechnologies/fable-base")
pipe("Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'")
```

Example output:

```python
[{'entity': 'B-CHA',
  'score': np.float32(0.91116154),
  'index': 1,
  'word': '▁Alice',
  'start': 0,
  'end': 5},
 {'entity': 'B-FAC',
  'score': np.float32(0.40558067),
  'index': 15,
  'word': '▁bank',
  'start': 69,
  'end': 74},
 {'entity': 'B-OBJ',
  'score': np.float32(0.5218266),
  'index': 33,
  'word': '▁book',
  'start': 142,
  'end': 147},
 {'entity': 'B-OBJ',
  'score': np.float32(0.5387561),
  'index': 57,
  'word': '▁book',
  'start': 244,
  'end': 249},
 {'entity': 'B-CHA',
  'score': np.float32(0.91744995),
  'index': 61,
  'word': '▁Alice',
  'start': 259,
  'end': 265}]
```

## Training Details

### Training Data

The model was trained on the [Fiction-NER-750M dataset](https://huggingface.co/datasets/SaladTechnologies/fiction-ner-750m), which consists of 750 million tokens of annotated literary text from a variety of sources, including Project Gutenberg and other permissively licensed texts.

### Training Procedure

The model was trained on 12 million examples, with a validation set of 1.2 million examples, for 1 epoch, using Focal Loss to address class imbalance.

#### Training Hyperparameters

See [train.ipynb](https://huggingface.co/SaladTechnologies/fable-base/blob/main/train.ipynb) for the full training code.

## Evaluation

The model achieves F1 score of approximately .752 on the validation set. However, spot checks of the model's predictions on unseen texts suggest that it outperforms this metric, which may be depressed by inconsistencies in the training data annotations.

## Environmental Impact

Carbon emissions estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** 8x A100
- **Hours used:** 24 GPU hours
- **Cloud Provider:** [Salad Technologies](https://salad.com)
- **Carbon Emitted:** 2.22 kg CO2eq