|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- fr |
|
|
base_model: |
|
|
- google-bert/bert-base-multilingual-cased |
|
|
pipeline_tag: text-classification |
|
|
datasets: |
|
|
- GEODE/GeoEDdA-TopoRel |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# bert-base-multilingual-cased-classification-ner |
|
|
|
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
This model is designed to classify place named entities recognized from geographic encyclopedia articles. |
|
|
It is a fine-tuned version of the bert-base-multilingual-cased model. |
|
|
It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects |
|
|
- **Model type:** Text classification |
|
|
- **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg) |
|
|
- **Language(s) (NLP):** French |
|
|
- **License:** cc-by-nc-4.0 |
|
|
|
|
|
|
|
|
## Class labels |
|
|
|
|
|
|
|
|
The tagset is as follows: |
|
|
- **City**: |
|
|
- **Country**: |
|
|
- **Human-made**: |
|
|
- **Island**: |
|
|
- **Lake**: |
|
|
- **Mountain**: |
|
|
- **Other**: |
|
|
- **Region**: |
|
|
- **River**: |
|
|
- **Sea**: |
|
|
|
|
|
## Dataset |
|
|
|
|
|
|
|
|
The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset. |
|
|
The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes: |
|
|
|
|
|
| | Train | Validation | Test| |
|
|
|---|:---:|:---:|:---:| |
|
|
| City | 2,657 | 276 | 277 |
|
|
| Country | 1,544 | 239 | 169 |
|
|
| Human-made | 104 | 7 | 7 |
|
|
| Island | 554 | 81 | 109 |
|
|
| Lake | 69 | 15 | 11 |
|
|
| Mountain | 232 | 76 | 70 |
|
|
| Other | 235 | 47 | 39 |
|
|
| Region | 2,706 | 424 | 440 |
|
|
| River | 128 | 944 | 125 |
|
|
| Sea | 196 | 37 | 57 |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
|
|
|
* Overall weighted-average model performances |
|
|
|
|
|
|
|
|
| | Precision | Recall | F-score | |
|
|
|---|:---:|:---:|:---:| |
|
|
| | 0.84 | 0.84 | 0.84 | |
|
|
|
|
|
|
|
|
|
|
|
* Model performances (Test set) |
|
|
|
|
|
| | Precision | Recall | F-score | Support | |
|
|
|---|:---:|:---:|:---:|:---:| |
|
|
| City | 0.82 | 0.88 | 0.85 | 277 |
|
|
| Country | 0.80 | 0.91 | 0.85 | 169 |
|
|
| Human-made | 0.50 | 0.71 | 0.59 | 7 |
|
|
| Island | 0.79 | 0.76 | 0.78 | 109 |
|
|
| Lake | 1.00 | 0.64 | 0.78 | 11 |
|
|
| Mountain | 0.81 | 0.73 | 0.77 | 70 |
|
|
| Other | 0.68 | 0.49 | 0.57 | 39 |
|
|
| Region | 0.89 | 0.85 | 0.87 | 440 |
|
|
| River | 0.87 | 0.90 | 0.88 | 125 |
|
|
| Sea | 0.96 | 0.93 | 0.95 | 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification |
|
|
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")) |
|
|
|
|
|
ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device) |
|
|
placename_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-ner", truncation=True, device=device) |
|
|
|
|
|
def get_context(text, span, ngram_context_size=5): |
|
|
word = span["word"] |
|
|
start = span["start"] |
|
|
end = span["end"] |
|
|
label = span["entity_group"] |
|
|
|
|
|
# Extract context |
|
|
previous_text = text[:start].strip() |
|
|
next_text = text[end:].strip() |
|
|
previous_words = previous_text.split()[-ngram_context_size:] |
|
|
next_words = next_text.split()[:ngram_context_size] |
|
|
|
|
|
# Build context string |
|
|
context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}" |
|
|
return word, context, label |
|
|
|
|
|
content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3." |
|
|
|
|
|
spans = ner(content) |
|
|
for span in spans: |
|
|
if span['entity_group'] == 'NP_Spatial': |
|
|
word, context, label = get_context(content, span, ngram_context_size=5) |
|
|
print(f"Place name: {word}") |
|
|
|
|
|
label = placename_classifier(context) |
|
|
print(f"Predicted label: {label}") |
|
|
|
|
|
|
|
|
# Output |
|
|
Place name: Wintchester |
|
|
Predicted label: [{'label': 'City', 'score': 0.9968810081481934}] |
|
|
Place name: Angleterre |
|
|
Predicted label: [{'label': 'Country', 'score': 0.9953059554100037}] |
|
|
Place name: Hampshire |
|
|
Predicted label: [{'label': 'Region', 'score': 0.9967537522315979}] |
|
|
Place name: Itching |
|
|
Predicted label: [{'label': 'River', 'score': 0.9929990768432617}] |
|
|
Place name: Salisbury |
|
|
Predicted label: [{'label': 'City', 'score': 0.9969013929367065}] |
|
|
Place name: Londres |
|
|
Predicted label: [{'label': 'City', 'score': 0.9969471096992493}] |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
|
|
This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora. |
|
|
|
|
|
|
|
|
|
|
|
## Acknowledgement |
|
|
|
|
|
The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). |
|
|
Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago. |