Update README.md

78c1c35 verified 5 months ago

6.18 kB

	---
	license: cc-by-nc-4.0
	language:
	- fr
	base_model:
	- google-bert/bert-base-multilingual-cased
	pipeline_tag: text-classification
	datasets:
	- GEODE/GeoEDdA-TopoRel
	---



	# bert-base-multilingual-cased-classification-ner


	<!-- Provide a quick summary of what the model is/does. -->

	This model is designed to classify place named entities recognized from geographic encyclopedia articles.
	It is a fine-tuned version of the bert-base-multilingual-cased model.
	It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772) edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).




	## Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Authors: Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects
	- Model type: Text classification
	- Repository: [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
	- Language(s) (NLP): French
	- License: cc-by-nc-4.0


	## Class labels


	The tagset is as follows:
	- City:
	- Country:
	- Human-made:
	- Island:
	- Lake:
	- Mountain:
	- Other:
	- Region:
	- River:
	- Sea:

	## Dataset


	The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
	The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:

	\| \| Train \| Validation \| Test\|
	\|---\|:---:\|:---:\|:---:\|
	\| City \| 2,657 \| 276 \| 277
	\| Country \| 1,544 \| 239 \| 169
	\| Human-made \| 104 \| 7 \| 7
	\| Island \| 554 \| 81 \| 109
	\| Lake \| 69 \| 15 \| 11
	\| Mountain \| 232 \| 76 \| 70
	\| Other \| 235 \| 47 \| 39
	\| Region \| 2,706 \| 424 \| 440
	\| River \| 128 \| 944 \| 125
	\| Sea \| 196 \| 37 \| 57


	## Evaluation


	* Overall weighted-average model performances


	\| \| Precision \| Recall \| F-score \|
	\|---\|:---:\|:---:\|:---:\|
	\| \| 0.84 \| 0.84 \| 0.84 \|



	* Model performances (Test set)

	\| \| Precision \| Recall \| F-score \| Support \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| City \| 0.82 \| 0.88 \| 0.85 \| 277
	\| Country \| 0.80 \| 0.91 \| 0.85 \| 169
	\| Human-made \| 0.50 \| 0.71 \| 0.59 \| 7
	\| Island \| 0.79 \| 0.76 \| 0.78 \| 109
	\| Lake \| 1.00 \| 0.64 \| 0.78 \| 11
	\| Mountain \| 0.81 \| 0.73 \| 0.77 \| 70
	\| Other \| 0.68 \| 0.49 \| 0.57 \| 39
	\| Region \| 0.89 \| 0.85 \| 0.87 \| 440
	\| River \| 0.87 \| 0.90 \| 0.88 \| 125
	\| Sea \| 0.96 \| 0.93 \| 0.95 \| 57






	## How to Get Started with the Model

	Use the code below to get started with the model.


	```python
	import torch
	from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
	device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

	ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
	placename_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-ner", truncation=True, device=device)

	def get_context(text, span, ngram_context_size=5):
	word = span["word"]
	start = span["start"]
	end = span["end"]
	label = span["entity_group"]

	# Extract context
	previous_text = text[:start].strip()
	next_text = text[end:].strip()
	previous_words = previous_text.split()[-ngram_context_size:]
	next_words = next_text.split()[:ngram_context_size]

	# Build context string
	context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
	return word, context, label

	content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."

	spans = ner(content)
	for span in spans:
	if span['entity_group'] == 'NP_Spatial':
	word, context, label = get_context(content, span, ngram_context_size=5)
	print(f"Place name: {word}")

	label = placename_classifier(context)
	print(f"Predicted label: {label}")


	# Output
	Place name: Wintchester
	Predicted label: [{'label': 'City', 'score': 0.9968810081481934}]
	Place name: Angleterre
	Predicted label: [{'label': 'Country', 'score': 0.9953059554100037}]
	Place name: Hampshire
	Predicted label: [{'label': 'Region', 'score': 0.9967537522315979}]
	Place name: Itching
	Predicted label: [{'label': 'River', 'score': 0.9929990768432617}]
	Place name: Salisbury
	Predicted label: [{'label': 'City', 'score': 0.9969013929367065}]
	Place name: Londres
	Predicted label: [{'label': 'City', 'score': 0.9969471096992493}]

	```


	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.



	## Acknowledgement

	The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
	Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.