GEODE
/

bert-base-multilingual-cased-classification-ner

+---
+license: cc-by-nc-4.0
+language:
+- fr
+base_model:
+- google-bert/bert-base-multilingual-cased
+pipeline_tag: text-classification
+datasets:
+- GEODE/GeoEDdA-TopoRel
+---
+# bert-base-multilingual-cased-geography-entry-classification
+<!-- Provide a quick summary of what the model is/does. -->
+This model is designed to classify place named entities recognized from geographic encyclopedia articles.
+It is a fine-tuned version of the bert-base-multilingual-cased model.
+It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).
+## Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Authors:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects
+- **Model type:** Text classification
+- **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
+- **Language(s) (NLP):** French
+- **License:** cc-by-nc-4.0
+## Class labels
+The tagset is as follows:
+- **City**:
+- **Country**:
+- **Human-made**:
+- **Island**:
+- **Lake**:
+- **Mountain**:
+- **Other**:
+- **Region**:
+- **River**:
+- **Sea**:
+## Dataset
+The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
+The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:
+|   | Train | Validation | Test|
+|---|:---:|:---:|:---:|
+|       City     | 2,657 |  276 |      277
+|      Country   | 1,544 |  239 |      169
+|   Human-made   |   104 |    7 |        7
+|       Island   |   554 |   81 |      109
+|         Lake   |    69 |   15 |       11
+|     Mountain   |   232 |   76 |       70
+|        Other   |   235 |   47 |       39
+|       Region   | 2,706 |  424 |      440
+|        River   |   128 |  944 |      125
+|          Sea   |   196 |   37 |       57
+## Evaluation
+* Overall weighted-average model performances
+|   | Precision | Recall | F-score |
+|---|:---:|:---:|:---:|
+|    | 0.84   | 0.84   | 0.84 |
+* Model performances (Test set)
+|   | Precision | Recall | F-score | Support |
+|---|:---:|:---:|:---:|:---:|
+|       City     | 0.82   |  0.88   |  0.85    |    277
+|      Country   | 0.80   |  0.91   |  0.85    |    169
+|   Human-made   |   0.50 |    0.71 |    0.59  |      7
+|       Island   |   0.79 |    0.76 |    0.78  |    109
+|         Lake   |   1.00 |    0.64 |    0.78  |     11
+|     Mountain   |   0.81 |    0.73 |    0.77  |     70
+|        Other   |   0.68 |    0.49 |    0.57  |     39
+|       Region   |   0.89 |    0.85 |    0.87  |    440
+|        River   |   0.87 |    0.90 |    0.88  |    125
+|          Sea   |   0.96 |    0.93 |    0.95  |     57
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+import torch
+from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
+device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))
+ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
+placename_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-ner", truncation=True, device=device)
+def get_context(text, span, ngram_context_size=5):
+    word = span["word"]
+    start = span["start"]
+    end = span["end"]
+    label = span["entity_group"]
+    # Extract context
+    previous_text = text[:start].strip()
+    next_text = text[end:].strip()
+    previous_words = previous_text.split()[-ngram_context_size:]
+    next_words = next_text.split()[:ngram_context_size]
+    # Build context string
+    context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
+    return word, context, label
+content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."
+spans = ner(content)
+for span in spans:
+    if span['entity_group'] == 'NP_Spatial':
+        word, context, label = get_context(content, span, ngram_context_size=5)
+        print(f"Place name: {word}")
+        label = placename_classifier(context)
+        print(f"Predicted label: {label}")
+# Output
+Place name: Wintchester
+Predicted label: [{'label': 'City', 'score': 0.9968810081481934}]
+Place name: Angleterre
+Predicted label: [{'label': 'Country', 'score': 0.9953059554100037}]
+Place name: Hampshire
+Predicted label: [{'label': 'Region', 'score': 0.9967537522315979}]
+Place name: l'Itching
+Predicted label: [{'label': 'River', 'score': 0.9929990768432617}]
+Place name: Salisbury
+Predicted label: [{'label': 'City', 'score': 0.9969013929367065}]
+Place name: Londres
+Predicted label: [{'label': 'City', 'score': 0.9969471096992493}]
+```
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.
+## Acknowledgement
+The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
+Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.