lmoncla's picture
Update README.md
22838b9 verified
---
license: cc-by-nc-4.0
language:
- fr
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
datasets:
- GEODE/GeoEDdA-TopoRel
---
# bert-base-multilingual-cased-classification-relation
<!-- Provide a quick summary of what the model is/does. -->
This model is designed to classify spatial relations recognized from geographic encyclopedia articles.
It is a fine-tuned version of the bert-base-multilingual-cased model.
It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
## Model Description
<!-- Provide a longer summary of what this model is. -->
- **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects
- **Model type:** Text classification
- **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
- **Language(s) (NLP):** French
- **License:** cc-by-nc-4.0
## Class labels
The tagset is as follows:
- **Adjacency**:
- **Crosses**:
- **Distance-Orientation**:
- **Inclusion**:
- **Movement**:
- **Other**:
## Dataset
The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:
| | Train | Validation | Test|
|---|:---:|:---:|:---:|
| Adjacency | 498 | 59 | 75|
| Crosses | 397 | 50 | 29 |
| Distance-Orientation | 1,065 | 163 | 115 |
| Inclusion | 1,319 | 131 | 156 |
| Movement | 184 | 15 | 35 |
| Other | 195 | 30 | 42 |
## Evaluation
* Overall weighted-average model performances
| | Precision | Recall | F-score |
|---|:---:|:---:|:---:|
| | 0.92 | 0.92 | 0.92 |
* Model performances (Test set)
| | Precision | Recall | F-score | Support |
|---|:---:|:---:|:---:|:---:|
| Adjacency | 0.85 | 0.84 | 0.85 | 75|
| Crosses | 0.78 | 0.86 | 0.82 | 29 |
| Distance-Orientation | 0.93 | 0.99 | 0.96 | 115 |
| Inclusion | 0.97 | 0.98 | 0.97 | 156 |
| Movement | 0.89 | 0.69 | 0.77 | 35 |
| Other | 0.95 | 0.88 | 0.91 | 42 |
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))
ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
relation_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-relation", truncation=True, device=device)
def get_context(text, span, ngram_context_size=5):
word = span["word"]
start = span["start"]
end = span["end"]
label = span["entity_group"]
# Extract context
previous_text = text[:start].strip()
next_text = text[end:].strip()
previous_words = previous_text.split()[-ngram_context_size:]
next_words = next_text.split()[:ngram_context_size]
# Build context string
context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
return word, context, label
content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."
spans = ner(content)
for span in spans:
if span['entity_group'] == 'Relation':
word, context, label = get_context(content, span, ngram_context_size=5)
print(f"Relation: {word}")
label = relation_classifier(context)
print(f"Predicted label: {label}")
# Output
Relation: sur le bord de
Predicted label: [{'label': 'Crosses', 'score': 0.9778845906257629}]
Relation: à dix-huit milles au sud-est de
Predicted label: [{'label': 'Distance-Orientation', 'score': 0.9959626793861389}]
Relation: à soixante sud-ouest de
Predicted label: [{'label': 'Distance-Orientation', 'score': 0.9963018894195557}]
```
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.
## Acknowledgement
The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.