File size: 5,725 Bytes
179b765
 
 
 
 
 
 
 
 
a8a6a6a
 
 
179b765
 
 
 
 
 
 
 
 
 
 
a8a6a6a
179b765
 
 
 
 
 
 
 
0624ded
179b765
c7135a6
179b765
 
 
 
 
 
 
a8a6a6a
 
 
 
 
 
 
 
 
 
 
c8fb056
179b765
 
 
a8a6a6a
 
179b765
 
 
a8a6a6a
 
 
 
 
 
 
 
 
 
 
179b765
 
 
 
 
a710349
0624ded
e2ced73
 
a8a6a6a
a710349
179b765
a710349
179b765
e2ced73
 
a8a6a6a
179b765
 
 
 
 
 
a8a6a6a
 
 
 
 
 
 
 
 
 
9f08964
179b765
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30fc74f
179b765
 
 
 
 
 
 
 
 
 
a8a6a6a
 
 
 
 
179b765
 
 
 
 
 
 
 
 
 
 
 
 
 
a8a6a6a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: cc-by-nc-4.0
language:
- fr
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
widget:
- text: >-
    * ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.
datasets:
- GEODE/GeoEDdA-TopoRel
---



# bert-base-multilingual-cased-place-entry-classification


<!-- Provide a quick summary of what the model is/does. -->

This model is designed to classify geographic encyclopedia articles describing places.
It is a fine-tuned version of the bert-base-multilingual-cased model.
It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).




## Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/)
- **Model type:** Text classification
- **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
- **Language(s) (NLP):** French
- **License:** cc-by-nc-4.0


## Class labels


The tagset is as follows (with examples from the dataset):
- **City**: villes, bourgs, villages, etc.
- **Island**: îles, presqu'îles, etc.
- **Region**: régions, contrées, provinces, cercles, etc.
- **River**: rivières, fleuves,etc.
- **Mountain**: montagnes, vallées, etc.
- **Country**: pays, royaumes, etc.
- **Sea**: mer, golphe, baie, etc.
- **Other**: promontoires, caps, rivages, déserts, etc.
- **Human-made**: ports, châteaux, forteresses, abbayes, etc.
- **Lake**: lacs, étangs, marais, etc.


## Dataset

The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes: 

|   | Train | Validation | Test|
|---|:---:|:---:|:---:|
| City                  | 921 | 33 | 40 | 
| Island                | 216 | 20 | 27 |
| Region                | 138 | 40 |  28 |
| River                 | 133 | 20 |  28 |
| Mountain              | 63 | 29 |  22 |
| Human-made            | 38 | 10 |  9 |
| Other                 | 27 | 12 |  12 |
| Sea                   | 26 | 13 |  12 |
| Lake                  | 22 | 9 |  9 |
| Country               | 16 | 14 |  13 |



## Evaluation


* Overall macro-average model performances

 | Precision | Recall | F-score |
|:---:|:---:|:---:|
|0.95  |    0.92  |    0.93  | 


* Overall weighted-average model performances

| Precision | Recall | F-score |
|:---:|:---:|:---:|
|0.94  |    0.94  |    0.94  | 


* Model performances (Test set)

|   | Precision | Recall | F-score | Support |
|---|:---:|:---:|:---:|:---:|
|              City   |    0.91   |   1.00   |   0.95   |     40|
|            Island   |    0.96   |   0.96   |   0.96   |     27|
|             River   |    0.97   |   1.00   |   0.98   |     28|
|            Region   |    0.86   |   0.89   |   0.88   |     28|
|          Mountain   |    1.00   |   0.95   |   0.98   |     22|
|           Country   |    1.00   |   0.85   |   0.92   |     13|
|               Sea   |    1.00   |   0.92   |   0.96   |     12|
|             Other   |    0.90   |   0.75   |   0.82   |     12|
|        Human-made   |    0.90   |   1.00   |   0.95   |      9|
|              Lake   |    1.00   |   0.89   |   0.94   |      9|






## How to Get Started with the Model

Use the code below to get started with the model.


```python
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

tokenizer = AutoTokenizer.from_pretrained("GEODE/bert-base-multilingual-cased-place-entry-classification")
model = AutoModelForSequenceClassification.from_pretrained("GEODE/bert-base-multilingual-cased-place-entry-classification")

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, truncation=True, device=device)

samples = [
    "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.",
    "* ARCALU (Principauté d') petit état des Tartares-Monguls, sur la riviere d'Hoamko, où commence  la grande muraille de la Chine, sous le 122e degré de longitude & le 42e de latitude septentrionale."
]


for sample in samples:
    print(pipe(sample))


# Output

[{'label': 'City', 'score': 0.9969543218612671}]
[{'label': 'Region', 'score': 0.9811353087425232}]
```


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

This model was trained entirely on French encyclopaedic entries classified as Geography (and place) and will likely not perform well on text in other languages or other corpora. 



## Acknowledgement

The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.