--- license: cc-by-nc-4.0 language: - fr base_model: - google-bert/bert-base-multilingual-cased pipeline_tag: text-classification widget: - text: >- PEGOE, (Géog. anc.) 1°. ville de l'Achaie, dans la Mégaride ; 2°. ville de l'Hellespont, selon Ortelius ; 3°. ville de l'île de Cypre ou de la Cyrénie, selon Etienne le géographe. --- # bert-base-multilingual-cased-single-multiple-place-classification This model is designed to classify geographic encyclopedia articles describing places. It is a fine-tuned version of the bert-base-multilingual-cased model. It has been trained on a manually annotated subset of the French *Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772)* edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)). ## Model Description - **Developed by:** Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) - **Model type:** Text classification - **Repository:** - **Language(s) (NLP):** French - **License:** cc-by-nc-4.0 ## Class labels The tagset is as follows: - **Single**: only one place is described - **Multiple**: several places are described (a single name with multiple locations) ## Dataset The model was trained using a set of 8658 entries classified as 'Place' (using this model: https://huggingface.co/GEODE/bert-base-multilingual-cased-geography-entry-classification) among entries classified as 'Geography' (using this model: https://huggingface.co/GEODE/bert-base-multilingual-cased-edda-domain-classification). The datasets have the following distribution of entries among datasets and classes: | | Train | Validation | Test| |---|:---:|:---:|:---:| | Single | 5760 | 1235 | 1234 | | Multiple | 300 | 64 | 65 | ## Evaluation * Overall macro-average model performances | Precision | Recall | F-score | |:---:|:---:|:---:| | 0.92 | 0.92 | 0.92 | * Overall weighted-average model performances | Precision | Recall | F-score | |:---:|:---:|:---:| | 0.98 | 0.98 | 0.98 | * Model performances (Test set) | | Precision | Recall | F-score | Support | |---|:---:|:---:|:---:|:---:| | Multiple | 0.85 | 0.85 | 0.85 | 65 | | Single | 0.99 | 0.99 | 0.99 | 1234 | ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")) tokenizer = AutoTokenizer.from_pretrained("GEODE/bert-base-multilingual-cased-single-multiple-place-classification") model = AutoModelForSequenceClassification.from_pretrained("GEODE/bert-base-multilingual-cased-single-multiple-place-classification") pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, truncation=True, device=device) samples = [ "* ALBI, (Géog.) ville de France, capitale de l'Albigeois, dans le haut Languedoc : elle est sur le Tarn. Long. 19. 49. lat. 43. 55. 44.", "PEGOE, (Géog. anc.) 1°. ville de l'Achaie, dans la Mégaride ; 2°. ville de l'Hellespont, selon Ortelius ; 3°. ville de l'île de Cypre ou de la Cyrénie, selon Etienne le géographe. " ] for sample in samples: print(pipe(sample)) ``` ## Bias, Risks, and Limitations This model was trained entirely on French encyclopaedic entries classified as Geography (and place) and will likely not perform well on text in other languages or other corpora. ## Acknowledgement The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.