lmoncla commited on
Commit
bd3007d
·
verified ·
1 Parent(s): 1108839

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -2
README.md CHANGED
@@ -17,7 +17,7 @@ widget:
17
 
18
  <!-- Provide a quick summary of what the model is/does. -->
19
 
20
- This model is designed to identify and classify Named Entity Recognition.
21
  It has been trained on the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
22
  Dataset: [https://huggingface.co/datasets/GEODE/GeoEDdA](https://huggingface.co/datasets/GEODE/GeoEDdA)
23
 
@@ -26,7 +26,7 @@ Dataset: [https://huggingface.co/datasets/GEODE/GeoEDdA](https://huggingface.co/
26
 
27
  <!-- Provide a list of tag detected by the model. -->
28
 
29
- The NER detected by this model are:
30
  - **NC-Spatial**: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. `ville`, `la rivière`, `royaume`.
31
  - **NP-Spatial**: a proper noun identifying the name of a place (spatial named entities), e.g. `France`, `Paris`, `la Chine`.
32
  - **Relation**: spatial relation, e.g. `dans`, `sur`, `à 10 lieues de`.
@@ -39,6 +39,20 @@ The NER detected by this model are:
39
 
40
 
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ## Bias, Risks, and Limitations
43
 
44
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
@@ -46,6 +60,170 @@ The NER detected by this model are:
46
  This model was trained entirely on French encyclopedic entries and will likely not perform well on text in other languages or other corpora.
47
 
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ## Acknowledgement
50
 
51
 
 
17
 
18
  <!-- Provide a quick summary of what the model is/does. -->
19
 
20
+ This model is designed to identify and classify named entities (such as Spatial, Person, and MISC), nominal entities, spatial relations, and other relevant information such as geographic coordinates within French encyclopedic entries.
21
  It has been trained on the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
22
  Dataset: [https://huggingface.co/datasets/GEODE/GeoEDdA](https://huggingface.co/datasets/GEODE/GeoEDdA)
23
 
 
26
 
27
  <!-- Provide a list of tag detected by the model. -->
28
 
29
+ The tagset is as follows:
30
  - **NC-Spatial**: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. `ville`, `la rivière`, `royaume`.
31
  - **NP-Spatial**: a proper noun identifying the name of a place (spatial named entities), e.g. `France`, `Paris`, `la Chine`.
32
  - **Relation**: spatial relation, e.g. `dans`, `sur`, `à 10 lieues de`.
 
39
 
40
 
41
 
42
+ ## Model Description
43
+
44
+ <!-- Provide a longer summary of what this model is. -->
45
+
46
+ - **Developed by:** [Ludovic Moncla](https://ludovicmoncla.github.io) and Hédi Zeghidi in the framework of the [GEODE](https://geode-project.github.io) project.
47
+ - **Model type:** CamemBERT token-classification
48
+ - **Repository:** [https://github.com/GEODE-project/ner-bert](https://github.com/GEODE-project/ner-bert)
49
+ - **Language(s) (NLP):** French
50
+ - **License:** cc-by-nc-4.0
51
+ - **Dataset:** https://zenodo.org/records/10530177
52
+
53
+
54
+
55
+
56
  ## Bias, Risks, and Limitations
57
 
58
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
60
  This model was trained entirely on French encyclopedic entries and will likely not perform well on text in other languages or other corpora.
61
 
62
 
63
+
64
+ ## How to Get Started with the Model
65
+
66
+ Use the code below to get started with the model.
67
+
68
+
69
+ ```python
70
+ from transformers import pipeline
71
+ import torch
72
+ from datasets import load_dataset
73
+
74
+
75
+ pipe = pipeline("token-classification", model="GEODE/bert-base-french-cased-edda-ner", aggregation_strategy="simple", device=device)
76
+
77
+ content = "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44."
78
+
79
+ print(pipe(content))
80
+
81
+
82
+ # Output
83
+ [{'entity_group': 'Head',
84
+ 'score': 0.9622438,
85
+ 'word': 'ALBI',
86
+ 'start': 2,
87
+ 'end': 6},
88
+ {'entity_group': 'Domain_mark',
89
+ 'score': 0.9617155,
90
+ 'word': 'Géog.',
91
+ 'start': 9,
92
+ 'end': 14},
93
+ {'entity_group': 'NC_Spatial',
94
+ 'score': 0.9631812,
95
+ 'word': 'ville',
96
+ 'start': 16,
97
+ 'end': 21},
98
+ {'entity_group': 'NP_Spatial',
99
+ 'score': 0.969053,
100
+ 'word': 'France',
101
+ 'start': 25,
102
+ 'end': 31},
103
+ {'entity_group': 'NC_Spatial',
104
+ 'score': 0.96325177,
105
+ 'word': 'capitale',
106
+ 'start': 33,
107
+ 'end': 41},
108
+ {'entity_group': 'NP_Spatial',
109
+ 'score': 0.9679477,
110
+ 'word': "l'Albigeois",
111
+ 'start': 45,
112
+ 'end': 56},
113
+ {'entity_group': 'Relation',
114
+ 'score': 0.9517819,
115
+ 'word': 'dans',
116
+ 'start': 58,
117
+ 'end': 62},
118
+ {'entity_group': 'NP_Spatial',
119
+ 'score': 0.9682904,
120
+ 'word': 'le haut Languedoc',
121
+ 'start': 63,
122
+ 'end': 80},
123
+ {'entity_group': 'Relation',
124
+ 'score': 0.9356177,
125
+ 'word': 'sur',
126
+ 'start': 92,
127
+ 'end': 95},
128
+ {'entity_group': 'NP_Spatial',
129
+ 'score': 0.9690639,
130
+ 'word': 'le Tarn',
131
+ 'start': 96,
132
+ 'end': 103},
133
+ {'entity_group': 'Latlong',
134
+ 'score': 0.97551537,
135
+ 'word': 'Long. 19. 49. lat. 43. 55. 44',
136
+ 'start': 105,
137
+ 'end': 134}]
138
+ ```
139
+
140
+
141
+ ## Training Details
142
+
143
+ ### Training Data
144
+
145
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
146
+
147
+ The model was trained using a set of 2200 paragraphs randomly selected out of 2001 Encyclopédie's entries.
148
+ All paragraphs were written in French and are distributed as follows among the Encyclopédie knowledge domains:
149
+
150
+ | Knowledge domain | Paragraphs |
151
+ |---|:---:|
152
+ | Géographie | 1096 |
153
+ | Histoire | 259 |
154
+ | Droit Jurisprudence | 113 |
155
+ | Physique | 92 |
156
+ | Métiers | 92 |
157
+ | Médecine | 88 |
158
+ | Philosophie | 69 |
159
+ | Histoire naturelle | 65 |
160
+ | Belles-lettres | 65 |
161
+ | Militaire | 62 |
162
+ | Commerce | 48 |
163
+ | Beaux-arts | 44 |
164
+ | Agriculture | 36 |
165
+ | Chasse | 31 |
166
+ | Religion | 23 |
167
+ | Musique | 17 |
168
+
169
+
170
+ The spans/entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
171
+ A train/val/test split was used.
172
+ Validation and test sets are composed of 200 paragraphs each: 100 classified as 'Géographie' and 100 from another knowledge domain.
173
+ The datasets have the following breakdown of tokens and spans/entities.
174
+
175
+ | | Train | Validation | Test|
176
+ |---|:---:|:---:|:---:|
177
+ |Paragraphs| 1,800 | 200 | 200|
178
+ | Tokens | 132,398 | 14,959 | 13,881 |
179
+ | NC-Spatial | 3,252 | 358 | 355 |
180
+ | NP-Spatial | 4,707 | 464 | 519 |
181
+ | Relation | 2,093 | 219 | 226 |
182
+ | Latlong | 553 | 66 | 72 |
183
+ | NC-Person | 1,378 | 132 | 133 |
184
+ | NP-Person | 1,599 | 170 | 150 |
185
+ | NP-Misc | 948 | 108 | 96 |
186
+ | Head | 1,261 | 142 | 153 |
187
+ | Domain-Mark | 1,069 | 122 | 133 |
188
+
189
+
190
+ ### Training Procedure
191
+
192
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
193
+
194
+ For full training details and results please see the GitHub repository: [https://github.com/GEODE-project/ner-bert](https://github.com/GEODE-project/ner-bert)
195
+
196
+
197
+ ### Evaluation
198
+
199
+
200
+ * Overall model performances (Test set)
201
+
202
+
203
+ | | Precision | Recall | F-score |
204
+ |---|:---:|:---:|:---:|
205
+ | | 90.1 | 93.7 | 91.9 |
206
+
207
+
208
+
209
+ * Model performances by entity (Test set)
210
+
211
+ | | Precision | Recall | F-score |
212
+ |---|:---:|:---:|:---:|
213
+ | NC-Spatial | 91.6 | 95.3 | 93.4 |
214
+ | NP-Spatial | 95.9 | 95.5 | 95.7 |
215
+ | Relation | 89.4 | 94.7 | 91.9 |
216
+ | Latlong | 98.1 | 96.8 | 97.4 |
217
+ | NC-Person | 67.5 | 84.0 | 74.9 |
218
+ | NP-Person | 87.4 | 89.2 | 88.3 |
219
+ | NP-Misc | 72.4 | 76.6 | 74.4 |
220
+ | Head | 97.6 | 97.2 | 97.4 |
221
+ | Domain-mark | 99.2 | 100.0 | 99.6 |
222
+
223
+
224
+
225
+
226
+
227
  ## Acknowledgement
228
 
229