|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- fr |
|
|
base_model: |
|
|
- google-bert/bert-base-multilingual-cased |
|
|
pipeline_tag: text-classification |
|
|
widget: |
|
|
- text: >- |
|
|
MAEATAE, (Géogr. anc.) anciens peuples de l'île de la grande Bretagne ; ils |
|
|
étoient auprès du mur qui coupoit l'île en deux parties. |
|
|
datasets: |
|
|
- GEODE/GeoEDdA-TopoRel |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# bert-base-multilingual-cased-geography-entry-classification |
|
|
|
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
This model is designed to classify geographic encyclopedia articles into Place, Person, or Other. |
|
|
It is a fine-tuned version of the bert-base-multilingual-cased model. |
|
|
It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects |
|
|
- **Model type:** Text classification |
|
|
- **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg) |
|
|
- **Language(s) (NLP):** French |
|
|
- **License:** cc-by-nc-4.0 |
|
|
|
|
|
|
|
|
## Class labels |
|
|
|
|
|
|
|
|
The tagset is as follows: |
|
|
- **Place**: encyclopedia entry describing the name of a place (such as a city, a river, a country, etc.) |
|
|
- **Person**: encyclopedia entry describing the name of a people or community |
|
|
- **Other**: encyclopedia entry describing any other type of entity (such as abstract geographic concepts, cross-references to other entries, etc.) |
|
|
|
|
|
|
|
|
## Dataset |
|
|
|
|
|
|
|
|
The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset. |
|
|
The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes: |
|
|
|
|
|
| | Train | Validation | Test| |
|
|
|---|:---:|:---:|:---:| |
|
|
| Place | 1,800 | 225 | 225| |
|
|
| Person | 200 | 25 | 25 | |
|
|
| Misc | 200 | 25 | 25 | |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
|
|
|
* Overall weighted-average model performances |
|
|
|
|
|
|
|
|
| | Precision | Recall | F-score | |
|
|
|---|:---:|:---:|:---:| |
|
|
| | 0.980 | 0.978 | 0.979 | |
|
|
|
|
|
|
|
|
|
|
|
* Model performances (Test set) |
|
|
|
|
|
| | Precision | Recall | F-score | Support | |
|
|
|---|:---:|:---:|:---:|:---:| |
|
|
| Place | 0.99 | 0.98 | 0.99 | 225 | |
|
|
| Person | 1.00 | 0.96 | 0.98 | 25 | |
|
|
| Other | 0.83 | 0.96 | 0.89 | 25 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification |
|
|
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("GEODE/bert-base-multilingual-cased-geography-entry-classification") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("GEODE/bert-base-multilingual-cased-geography-entry-classification") |
|
|
|
|
|
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, truncation=True, device=device) |
|
|
|
|
|
samples = [ |
|
|
"* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.", |
|
|
"MAEATAE, (Géogr. anc.) anciens peuples de l'île de la grande Bretagne ; ils étoient auprès du mur qui coupoit l'île en deux parties. Cambden ne doute point que ce soit le Nortumberland.", |
|
|
"APPONDURE, s. f. terme de riviere ; mot dont on se sert dans la composition d'un train ; c'est une portion de perche employée pour fortifier le chantier lorsqu'il est trop menu." |
|
|
] |
|
|
|
|
|
for sample in samples: |
|
|
print(pipe(sample)) |
|
|
|
|
|
# Output |
|
|
[{'label': 'Place', 'score': 0.9984742999076843}] |
|
|
[{'label': 'Person', 'score': 0.9927592277526855}] |
|
|
[{'label': 'Other', 'score': 0.9885557293891907}] |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
|
|
This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora. |
|
|
|
|
|
|
|
|
|
|
|
## Acknowledgement |
|
|
|
|
|
The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). |
|
|
Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago. |