File size: 4,977 Bytes
b17da6d
 
 
 
 
 
 
 
 
dec3ccd
 
 
 
90adb70
 
 
 
 
 
 
 
 
6d92285
90adb70
6d92285
90adb70
 
 
 
 
 
 
 
dec3ccd
90adb70
dec3ccd
90adb70
 
 
 
 
 
eacc455
 
 
 
dec3ccd
90adb70
 
ef97fd1
 
 
dec3ccd
 
ef97fd1
 
 
dec3ccd
 
 
ef97fd1
 
f9c30fa
 
 
 
 
 
 
 
dec3ccd
f9c30fa
 
 
 
 
 
 
dec3ccd
 
 
f9c30fa
 
 
 
90adb70
112dc64
 
 
 
 
 
 
fc4aa37
112dc64
 
879826e
 
 
 
112dc64
 
 
 
 
 
 
 
 
 
 
dec3ccd
 
 
112dc64
 
 
 
90adb70
 
 
 
 
 
 
 
 
 
 
dec3ccd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: cc-by-nc-4.0
language:
- fr
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
widget:
- text: >-
    MAEATAE, (Géogr. anc.) anciens peuples de l'île de la grande Bretagne ; ils
    étoient auprès du mur qui coupoit l'île en deux parties.
datasets:
- GEODE/GeoEDdA-TopoRel
---



# bert-base-multilingual-cased-geography-entry-classification


<!-- Provide a quick summary of what the model is/does. -->

This model is designed to classify geographic encyclopedia articles into Place, Person, or Other.
It is a fine-tuned version of the bert-base-multilingual-cased model.
It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).




## Model Description

<!-- Provide a longer summary of what this model is. -->

- **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects
- **Model type:** Text classification
- **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
- **Language(s) (NLP):** French
- **License:** cc-by-nc-4.0


## Class labels


The tagset is as follows:
- **Place**: encyclopedia entry describing the name of a place (such as a city, a river, a country, etc.)
- **Person**: encyclopedia entry describing the name of a people or community
- **Other**: encyclopedia entry describing any other type of entity (such as abstract geographic concepts, cross-references to other entries, etc.)


## Dataset


The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes: 

|   | Train | Validation | Test|
|---|:---:|:---:|:---:|
| Place | 1,800 | 225 | 225|
| Person | 200 | 25 | 25 |
| Misc | 200 | 25 | 25 |


## Evaluation


* Overall weighted-average model performances


|   | Precision | Recall | F-score |
|---|:---:|:---:|:---:|
|    | 0.980   | 0.978   | 0.979 | 



* Model performances (Test set)

|   | Precision | Recall | F-score | Support |
|---|:---:|:---:|:---:|:---:|
| Place    |  0.99  |  0.98  |  0.99 | 225 |
| Person   |  1.00  |  0.96  |  0.98 | 25 |
| Other     |  0.83  |  0.96  |  0.89 | 25 |





## How to Get Started with the Model

Use the code below to get started with the model.


```python
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

tokenizer = AutoTokenizer.from_pretrained("GEODE/bert-base-multilingual-cased-geography-entry-classification")
model = AutoModelForSequenceClassification.from_pretrained("GEODE/bert-base-multilingual-cased-geography-entry-classification")

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer,  truncation=True, device=device)

samples = [
    "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.",
    "MAEATAE, (Géogr. anc.) anciens peuples de l'île de la grande Bretagne ; ils étoient auprès du mur qui coupoit l'île en deux parties. Cambden ne doute point que ce soit le Nortumberland.",
    "APPONDURE, s. f. terme de riviere ; mot dont on se sert dans la composition d'un train ; c'est une portion  de perche employée pour fortifier le chantier lorsqu'il est trop menu."
]

for sample in samples:
    print(pipe(sample))

# Output
[{'label': 'Place', 'score': 0.9984742999076843}]
[{'label': 'Person', 'score': 0.9927592277526855}]
[{'label': 'Other', 'score': 0.9885557293891907}]

```


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora. 



## Acknowledgement

The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.