GEODE
/

camembert-base-edda-span-classification

@@ -17,7 +17,7 @@ widget:
 <!-- Provide a quick summary of what the model is/does. -->
-This model is designed to identify and classify Named Entity Recognition.
 It has been trained on the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
 Dataset: [https://huggingface.co/datasets/GEODE/GeoEDdA](https://huggingface.co/datasets/GEODE/GeoEDdA)
@@ -26,7 +26,7 @@ Dataset: [https://huggingface.co/datasets/GEODE/GeoEDdA](https://huggingface.co/
 <!-- Provide a list of tag detected by the model. -->
-The NER detected by this model are:
 - **NC-Spatial**: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. `ville`, `la rivière`, `royaume`.
 - **NP-Spatial**: a proper noun identifying the name of a place (spatial named entities), e.g. `France`, `Paris`, `la Chine`.
 - **Relation**: spatial relation, e.g. `dans`, `sur`, `à 10 lieues de`.
@@ -39,6 +39,20 @@ The NER detected by this model are:
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
@@ -46,6 +60,170 @@ The NER detected by this model are:
 This model was trained entirely on French encyclopedic entries and will likely not perform well on text in other languages or other corpora.
 ## Acknowledgement

 <!-- Provide a quick summary of what the model is/does. -->
+This model is designed to identify and classify named entities (such as Spatial, Person, and MISC), nominal entities, spatial relations, and other relevant information such as geographic coordinates within French encyclopedic entries.
 It has been trained on the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
 Dataset: [https://huggingface.co/datasets/GEODE/GeoEDdA](https://huggingface.co/datasets/GEODE/GeoEDdA)
 <!-- Provide a list of tag detected by the model. -->
+The tagset is as follows:
 - **NC-Spatial**: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. `ville`, `la rivière`, `royaume`.
 - **NP-Spatial**: a proper noun identifying the name of a place (spatial named entities), e.g. `France`, `Paris`, `la Chine`.
 - **Relation**: spatial relation, e.g. `dans`, `sur`, `à 10 lieues de`.
+## Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [Ludovic Moncla](https://ludovicmoncla.github.io) and Hédi Zeghidi in the framework of the [GEODE](https://geode-project.github.io) project.
+- **Model type:** CamemBERT token-classification
+- **Repository:** [https://github.com/GEODE-project/ner-bert](https://github.com/GEODE-project/ner-bert)
+- **Language(s) (NLP):** French
+- **License:** cc-by-nc-4.0
+- **Dataset:** https://zenodo.org/records/10530177
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 This model was trained entirely on French encyclopedic entries and will likely not perform well on text in other languages or other corpora.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from transformers import pipeline
+import torch
+from datasets import load_dataset
+pipe = pipeline("token-classification", model="GEODE/bert-base-french-cased-edda-ner", aggregation_strategy="simple", device=device)
+content = "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44."
+print(pipe(content))
+# Output
+[{'entity_group': 'Head',
+  'score': 0.9622438,
+  'word': 'ALBI',
+  'start': 2,
+  'end': 6},
+ {'entity_group': 'Domain_mark',
+  'score': 0.9617155,
+  'word': 'Géog.',
+  'start': 9,
+  'end': 14},
+ {'entity_group': 'NC_Spatial',
+  'score': 0.9631812,
+  'word': 'ville',
+  'start': 16,
+  'end': 21},
+ {'entity_group': 'NP_Spatial',
+  'score': 0.969053,
+  'word': 'France',
+  'start': 25,
+  'end': 31},
+ {'entity_group': 'NC_Spatial',
+  'score': 0.96325177,
+  'word': 'capitale',
+  'start': 33,
+  'end': 41},
+ {'entity_group': 'NP_Spatial',
+  'score': 0.9679477,
+  'word': "l'Albigeois",
+  'start': 45,
+  'end': 56},
+ {'entity_group': 'Relation',
+  'score': 0.9517819,
+  'word': 'dans',
+  'start': 58,
+  'end': 62},
+ {'entity_group': 'NP_Spatial',
+  'score': 0.9682904,
+  'word': 'le haut Languedoc',
+  'start': 63,
+  'end': 80},
+ {'entity_group': 'Relation',
+  'score': 0.9356177,
+  'word': 'sur',
+  'start': 92,
+  'end': 95},
+ {'entity_group': 'NP_Spatial',
+  'score': 0.9690639,
+  'word': 'le Tarn',
+  'start': 96,
+  'end': 103},
+ {'entity_group': 'Latlong',
+  'score': 0.97551537,
+  'word': 'Long. 19. 49. lat. 43. 55. 44',
+  'start': 105,
+  'end': 134}]
+```
+## Training Details
+### Training Data
+<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The model was trained using a set of 2200 paragraphs randomly selected out of 2001 Encyclopédie's entries.
+All paragraphs were written in French and are distributed as follows among the Encyclopédie knowledge domains:
+| Knowledge domain | Paragraphs |
+|---|:---:|
+| Géographie | 1096 |
+| Histoire | 259 |
+| Droit Jurisprudence | 113 |
+| Physique | 92 |
+| Métiers | 92 |
+| Médecine | 88 |
+| Philosophie | 69 |
+| Histoire naturelle | 65 |
+| Belles-lettres | 65 |
+| Militaire | 62 |
+| Commerce | 48 |
+| Beaux-arts | 44 |
+| Agriculture | 36 |
+| Chasse | 31 |
+| Religion | 23 |
+| Musique | 17 |
+The spans/entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
+A train/val/test split was used.
+Validation and test sets are composed of 200 paragraphs each: 100 classified as 'Géographie' and 100 from another knowledge domain.
+The datasets have the following breakdown of tokens and spans/entities.
+|   | Train | Validation | Test|
+|---|:---:|:---:|:---:|
+|Paragraphs| 1,800 | 200 | 200|
+| Tokens | 132,398 | 14,959 | 13,881 |
+| NC-Spatial | 3,252 | 358 | 355 |
+| NP-Spatial | 4,707 | 464 | 519 |
+| Relation | 2,093 | 219 | 226 |
+| Latlong | 553 | 66 | 72 |
+| NC-Person | 1,378 | 132 | 133 |
+| NP-Person | 1,599 | 170 | 150 |
+| NP-Misc | 948 | 108 | 96 |
+| Head | 1,261 | 142 | 153 |
+| Domain-Mark | 1,069 | 122 | 133 |
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+For full training details and results please see the GitHub repository: [https://github.com/GEODE-project/ner-bert](https://github.com/GEODE-project/ner-bert)
+### Evaluation
+* Overall model performances (Test set)
+|   | Precision | Recall | F-score |
+|---|:---:|:---:|:---:|
+|    | 90.1   | 93.7   | 91.9 |
+* Model performances by entity (Test set)
+|   | Precision | Recall | F-score |
+|---|:---:|:---:|:---:|
+| NC-Spatial    |  91.6  |  95.3  |  93.4 |
+| NP-Spatial    |  95.9  |  95.5  |  95.7 |
+| Relation      |  89.4  |  94.7  |  91.9 |
+| Latlong       |  98.1  |  96.8  |  97.4 |
+| NC-Person     |  67.5  |  84.0  |  74.9 |
+| NP-Person     |  87.4  |  89.2  |  88.3 |
+| NP-Misc       |  72.4  |  76.6  |  74.4 |
+| Head          |  97.6  |  97.2  |  97.4 |
+| Domain-mark   |  99.2  |  100.0  |  99.6 |
 ## Acknowledgement