--- license: cc-by-nc-4.0 language: - fr base_model: - google-bert/bert-base-multilingual-cased pipeline_tag: text-classification widget: - text: >- * ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44. datasets: - GEODE/GeoEDdA-TopoRel --- # bert-base-multilingual-cased-place-entry-classification This model is designed to classify geographic encyclopedia articles describing places. It is a fine-tuned version of the bert-base-multilingual-cased model. It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)). ## Model Description - **Developed by:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) - **Model type:** Text classification - **Repository:** [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg) - **Language(s) (NLP):** French - **License:** cc-by-nc-4.0 ## Class labels The tagset is as follows (with examples from the dataset): - **City**: villes, bourgs, villages, etc. - **Island**: îles, presqu'îles, etc. - **Region**: régions, contrées, provinces, cercles, etc. - **River**: rivières, fleuves,etc. - **Mountain**: montagnes, vallées, etc. - **Country**: pays, royaumes, etc. - **Sea**: mer, golphe, baie, etc. - **Other**: promontoires, caps, rivages, déserts, etc. - **Human-made**: ports, châteaux, forteresses, abbayes, etc. - **Lake**: lacs, étangs, marais, etc. ## Dataset The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset. The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes: | | Train | Validation | Test| |---|:---:|:---:|:---:| | City | 921 | 33 | 40 | | Island | 216 | 20 | 27 | | Region | 138 | 40 | 28 | | River | 133 | 20 | 28 | | Mountain | 63 | 29 | 22 | | Human-made | 38 | 10 | 9 | | Other | 27 | 12 | 12 | | Sea | 26 | 13 | 12 | | Lake | 22 | 9 | 9 | | Country | 16 | 14 | 13 | ## Evaluation * Overall macro-average model performances | Precision | Recall | F-score | |:---:|:---:|:---:| |0.95 | 0.92 | 0.93 | * Overall weighted-average model performances | Precision | Recall | F-score | |:---:|:---:|:---:| |0.94 | 0.94 | 0.94 | * Model performances (Test set) | | Precision | Recall | F-score | Support | |---|:---:|:---:|:---:|:---:| | City | 0.91 | 1.00 | 0.95 | 40| | Island | 0.96 | 0.96 | 0.96 | 27| | River | 0.97 | 1.00 | 0.98 | 28| | Region | 0.86 | 0.89 | 0.88 | 28| | Mountain | 1.00 | 0.95 | 0.98 | 22| | Country | 1.00 | 0.85 | 0.92 | 13| | Sea | 1.00 | 0.92 | 0.96 | 12| | Other | 0.90 | 0.75 | 0.82 | 12| | Human-made | 0.90 | 1.00 | 0.95 | 9| | Lake | 1.00 | 0.89 | 0.94 | 9| ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")) tokenizer = AutoTokenizer.from_pretrained("GEODE/bert-base-multilingual-cased-place-entry-classification") model = AutoModelForSequenceClassification.from_pretrained("GEODE/bert-base-multilingual-cased-place-entry-classification") pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, truncation=True, device=device) samples = [ "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.", "* ARCALU (Principauté d') petit état des Tartares-Monguls, sur la riviere d'Hoamko, où commence la grande muraille de la Chine, sous le 122e degré de longitude & le 42e de latitude septentrionale." ] for sample in samples: print(pipe(sample)) # Output [{'label': 'City', 'score': 0.9969543218612671}] [{'label': 'Region', 'score': 0.9811353087425232}] ``` ## Bias, Risks, and Limitations This model was trained entirely on French encyclopaedic entries classified as Geography (and place) and will likely not perform well on text in other languages or other corpora. ## Acknowledgement The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.