Token Classification
Safetensors
English
deberta-v2
shawnrushefsky commited on
Commit
74ce5c5
·
1 Parent(s): 2fda53e

model card

Browse files
Files changed (1) hide show
  1. README.md +116 -1
README.md CHANGED
@@ -7,4 +7,119 @@ language:
7
  base_model:
8
  - microsoft/deberta-v3-base
9
  pipeline_tag: token-classification
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  base_model:
8
  - microsoft/deberta-v3-base
9
  pipeline_tag: token-classification
10
+ ---
11
+
12
+ # FABLE - Fiction Adapted BERT for Literary Entities
13
+
14
+ This is a named-entity recognition (NER) model called FABLE, which stands for Fiction Adapted BERT for Literary Entities. It is based on the DeBERTa v3 architecture and has been fine-tuned on the [Fiction-NER-750M dataset](https://huggingface.co/datasets/SaladTechnologies/fiction-ner-750m) of literary texts to recognize entities such as characters, locations, and other relevant terms in fiction.
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ FABLE is a transformer-based model designed for named-entity recognition (NER) tasks in literary texts. It has been fine-tuned on a large dataset of fiction to accurately identify and classify entities such as characters, locations, and other relevant terms.
21
+
22
+ Entity labels are in [BIO Tagging format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning the beginning of an entity is prefixed with `B-`, and tokens which are a continuation of that entity are prefixed with `I-`.
23
+
24
+ For example, the tokens `Arthur, Funkleton` would be tagged `B-CHA, I-CHA`, indicating that both tokens belong to the same Character entity.
25
+
26
+ - `O` - Outside / Not a Named Entity
27
+ - `CHA` - Character
28
+ - `LOC` - Location
29
+ - `FAC` - Facility
30
+ - `OBJ` - Important Object
31
+ - `EVT` - Event
32
+ - `ORG` - Organization
33
+ - `MISC` - Other Named Entity
34
+
35
+ - **Developed by:** Shawn Rushefsky - [🤗](https://huggingface.co/shawnrushefsky) | [github](https://github.com/shawnrushefsky)
36
+ - **Funded by:** [Salad Technologies](https://salad.com)
37
+ - **Model type:** NER / Token Classification
38
+ - **Language(s) (NLP):** English
39
+ - **License:** MIT
40
+ - **Finetuned from model:** microsoft/deberta-v3-base
41
+
42
+ ## Uses
43
+
44
+ This model is intended to be used in the analysis of literary texts, such as novels and short stories, to identify and classify named entities.
45
+
46
+ ## Bias, Risks, and Limitations
47
+
48
+ The training data comes from a diverse set of english-language narrative fiction spanning hundreds of years of authorship, and may include subject matter and phrasing that offend.
49
+ The age of much of the material from Project Gutenberg is such that white men from before the civil rights movement are vastly disproportionately represented as authors.
50
+ Additionally, contemporary commercial fiction is nearly all but excluded due to licensing restrictions.
51
+
52
+ ### Recommendations
53
+
54
+ Use at your own risk. This model is provided as-is, without warranty of any kind.
55
+
56
+ ## How to Get Started with the Model
57
+
58
+ ```python
59
+ from transformers import pipeline
60
+
61
+ pipe = pipeline("token-classification", model="SaladTechnologies/fable-base")
62
+ pipe("Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'")
63
+ ```
64
+
65
+ Example output:
66
+
67
+ ```python
68
+ [{'entity': 'B-CHA',
69
+ 'score': np.float32(0.91116154),
70
+ 'index': 1,
71
+ 'word': '▁Alice',
72
+ 'start': 0,
73
+ 'end': 5},
74
+ {'entity': 'B-FAC',
75
+ 'score': np.float32(0.40558067),
76
+ 'index': 15,
77
+ 'word': '▁bank',
78
+ 'start': 69,
79
+ 'end': 74},
80
+ {'entity': 'B-OBJ',
81
+ 'score': np.float32(0.5218266),
82
+ 'index': 33,
83
+ 'word': '▁book',
84
+ 'start': 142,
85
+ 'end': 147},
86
+ {'entity': 'B-OBJ',
87
+ 'score': np.float32(0.5387561),
88
+ 'index': 57,
89
+ 'word': '▁book',
90
+ 'start': 244,
91
+ 'end': 249},
92
+ {'entity': 'B-CHA',
93
+ 'score': np.float32(0.91744995),
94
+ 'index': 61,
95
+ 'word': '▁Alice',
96
+ 'start': 259,
97
+ 'end': 265}]
98
+ ```
99
+
100
+ ## Training Details
101
+
102
+ ### Training Data
103
+
104
+ The model was trained on the [Fiction-NER-750M dataset](https://huggingface.co/datasets/SaladTechnologies/fiction-ner-750m), which consists of 750 million tokens of annotated literary text from a variety of sources, including Project Gutenberg and other permissively licensed texts.
105
+
106
+ ### Training Procedure
107
+
108
+ The model was trained on 12 million examples, with a validation set of 1.2 million examples, for 1 epoch, using Focal Loss to address class imbalance.
109
+
110
+ #### Training Hyperparameters
111
+
112
+ See [train.ipynb](https://huggingface.co/SaladTechnologies/fable-base/blob/main/train.ipynb) for the full training code.
113
+
114
+ ## Evaluation
115
+
116
+ The model achieves F1 score of approximately .752 on the validation set. However, spot checks of the model's predictions on unseen texts suggest that it outperforms this metric, which may be depressed by inconsistencies in the training data annotations.
117
+
118
+ ## Environmental Impact
119
+
120
+ Carbon emissions estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
121
+
122
+ - **Hardware Type:** 8x A100
123
+ - **Hours used:** 24 GPU hours
124
+ - **Cloud Provider:** [Salad Technologies](https://salad.com)
125
+ - **Carbon Emitted:** 2.22 kg CO2eq