lmoncla commited on
Commit
5c31071
·
verified ·
1 Parent(s): 5c18d1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -3
README.md CHANGED
@@ -1,3 +1,174 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - fr
5
+ base_model:
6
+ - google-bert/bert-base-multilingual-cased
7
+ pipeline_tag: text-classification
8
+ datasets:
9
+ - GEODE/GeoEDdA-TopoRel
10
+ ---
11
+
12
+
13
+
14
+ # bert-base-multilingual-cased-geography-entry-classification
15
+
16
+
17
+ <!-- Provide a quick summary of what the model is/does. -->
18
+
19
+ This model is designed to classify place named entities recognized from geographic encyclopedia articles.
20
+ It is a fine-tuned version of the bert-base-multilingual-cased model.
21
+ It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
22
+
23
+
24
+
25
+
26
+ ## Model Description
27
+
28
+ <!-- Provide a longer summary of what this model is. -->
29
+
30
+ - **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects
31
+ - **Model type:** Text classification
32
+ - **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
33
+ - **Language(s) (NLP):** French
34
+ - **License:** cc-by-nc-4.0
35
+
36
+
37
+ ## Class labels
38
+
39
+
40
+ The tagset is as follows:
41
+ - **City**:
42
+ - **Country**:
43
+ - **Human-made**:
44
+ - **Island**:
45
+ - **Lake**:
46
+ - **Mountain**:
47
+ - **Other**:
48
+ - **Region**:
49
+ - **River**:
50
+ - **Sea**:
51
+
52
+ ## Dataset
53
+
54
+
55
+ The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
56
+ The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:
57
+
58
+ | | Train | Validation | Test|
59
+ |---|:---:|:---:|:---:|
60
+
61
+ | City | 2,657 | 276 | 277
62
+ | Country | 1,544 | 239 | 169
63
+ | Human-made | 104 | 7 | 7
64
+ | Island | 554 | 81 | 109
65
+ | Lake | 69 | 15 | 11
66
+ | Mountain | 232 | 76 | 70
67
+ | Other | 235 | 47 | 39
68
+ | Region | 2,706 | 424 | 440
69
+ | River | 128 | 944 | 125
70
+ | Sea | 196 | 37 | 57
71
+
72
+
73
+ ## Evaluation
74
+
75
+
76
+ * Overall weighted-average model performances
77
+
78
+
79
+ | | Precision | Recall | F-score |
80
+ |---|:---:|:---:|:---:|
81
+ | | 0.84 | 0.84 | 0.84 |
82
+
83
+
84
+
85
+ * Model performances (Test set)
86
+
87
+ | | Precision | Recall | F-score | Support |
88
+ |---|:---:|:---:|:---:|:---:|
89
+ | City | 0.82 | 0.88 | 0.85 | 277
90
+ | Country | 0.80 | 0.91 | 0.85 | 169
91
+ | Human-made | 0.50 | 0.71 | 0.59 | 7
92
+ | Island | 0.79 | 0.76 | 0.78 | 109
93
+ | Lake | 1.00 | 0.64 | 0.78 | 11
94
+ | Mountain | 0.81 | 0.73 | 0.77 | 70
95
+ | Other | 0.68 | 0.49 | 0.57 | 39
96
+ | Region | 0.89 | 0.85 | 0.87 | 440
97
+ | River | 0.87 | 0.90 | 0.88 | 125
98
+ | Sea | 0.96 | 0.93 | 0.95 | 57
99
+
100
+
101
+
102
+
103
+
104
+
105
+ ## How to Get Started with the Model
106
+
107
+ Use the code below to get started with the model.
108
+
109
+
110
+ ```python
111
+ import torch
112
+ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
113
+ device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))
114
+
115
+ ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
116
+ placename_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-ner", truncation=True, device=device)
117
+
118
+ def get_context(text, span, ngram_context_size=5):
119
+ word = span["word"]
120
+ start = span["start"]
121
+ end = span["end"]
122
+ label = span["entity_group"]
123
+
124
+ # Extract context
125
+ previous_text = text[:start].strip()
126
+ next_text = text[end:].strip()
127
+ previous_words = previous_text.split()[-ngram_context_size:]
128
+ next_words = next_text.split()[:ngram_context_size]
129
+
130
+ # Build context string
131
+ context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
132
+ return word, context, label
133
+
134
+ content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."
135
+
136
+ spans = ner(content)
137
+ for span in spans:
138
+ if span['entity_group'] == 'NP_Spatial':
139
+ word, context, label = get_context(content, span, ngram_context_size=5)
140
+ print(f"Place name: {word}")
141
+
142
+ label = placename_classifier(context)
143
+ print(f"Predicted label: {label}")
144
+
145
+
146
+ # Output
147
+ Place name: Wintchester
148
+ Predicted label: [{'label': 'City', 'score': 0.9968810081481934}]
149
+ Place name: Angleterre
150
+ Predicted label: [{'label': 'Country', 'score': 0.9953059554100037}]
151
+ Place name: Hampshire
152
+ Predicted label: [{'label': 'Region', 'score': 0.9967537522315979}]
153
+ Place name: l'Itching
154
+ Predicted label: [{'label': 'River', 'score': 0.9929990768432617}]
155
+ Place name: Salisbury
156
+ Predicted label: [{'label': 'City', 'score': 0.9969013929367065}]
157
+ Place name: Londres
158
+ Predicted label: [{'label': 'City', 'score': 0.9969471096992493}]
159
+
160
+ ```
161
+
162
+
163
+ ## Bias, Risks, and Limitations
164
+
165
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
166
+
167
+ This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.
168
+
169
+
170
+
171
+ ## Acknowledgement
172
+
173
+ The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
174
+ Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.