Update README.md

22838b9 verified 5 months ago

5.4 kB

	---
	license: cc-by-nc-4.0
	language:
	- fr
	base_model:
	- google-bert/bert-base-multilingual-cased
	pipeline_tag: text-classification
	datasets:
	- GEODE/GeoEDdA-TopoRel
	---



	# bert-base-multilingual-cased-classification-relation


	<!-- Provide a quick summary of what the model is/does. -->

	This model is designed to classify spatial relations recognized from geographic encyclopedia articles.
	It is a fine-tuned version of the bert-base-multilingual-cased model.
	It has been trained on [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel), a manually annotated subset of the French Encyclopédie ou dictionnaire raisonné des sciences des arts et des métiers par une société de gens de lettres (1751-1772) edited by Diderot and d'Alembert (provided by the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu)).




	## Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Authors: Bin Yang, [Ludovic Moncla](https://ludovicmoncla.github.io), [Fabien Duchateau](https://perso.liris.cnrs.fr/fabien.duchateau/) and [Frédérique Laforest](https://perso.liris.cnrs.fr/flaforest/) in the framework of the [ECoDA](https://liris.cnrs.fr/projet-institutionnel/fil-2025-projet-ecoda) and [GEODE](https://geode-project.github.io) projects
	- Model type: Text classification
	- Repository: [https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg](https://gitlab.liris.cnrs.fr/ecoda/encyclopedia2geokg)
	- Language(s) (NLP): French
	- License: cc-by-nc-4.0


	## Class labels


	The tagset is as follows:
	- Adjacency:
	- Crosses:
	- Distance-Orientation:
	- Inclusion:
	- Movement:
	- Other:


	## Dataset


	The model was trained using the [GeoEDdA-TopoRel](https://huggingface.co/datasets/GEODE/GeoEDdA-TopoRel) dataset.
	The dataset is splitted into train, validation and test sets which have the following distribution of entries among classes:

	\| \| Train \| Validation \| Test\|
	\|---\|:---:\|:---:\|:---:\|
	\| Adjacency \| 498 \| 59 \| 75\|
	\| Crosses \| 397 \| 50 \| 29 \|
	\| Distance-Orientation \| 1,065 \| 163 \| 115 \|
	\| Inclusion \| 1,319 \| 131 \| 156 \|
	\| Movement \| 184 \| 15 \| 35 \|
	\| Other \| 195 \| 30 \| 42 \|


	## Evaluation


	* Overall weighted-average model performances


	\| \| Precision \| Recall \| F-score \|
	\|---\|:---:\|:---:\|:---:\|
	\| \| 0.92 \| 0.92 \| 0.92 \|



	* Model performances (Test set)

	\| \| Precision \| Recall \| F-score \| Support \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\| Adjacency \| 0.85 \| 0.84 \| 0.85 \| 75\|
	\| Crosses \| 0.78 \| 0.86 \| 0.82 \| 29 \|
	\| Distance-Orientation \| 0.93 \| 0.99 \| 0.96 \| 115 \|
	\| Inclusion \| 0.97 \| 0.98 \| 0.97 \| 156 \|
	\| Movement \| 0.89 \| 0.69 \| 0.77 \| 35 \|
	\| Other \| 0.95 \| 0.88 \| 0.91 \| 42 \|





	## How to Get Started with the Model

	Use the code below to get started with the model.


	```python
	import torch
	from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
	device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

	ner = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)
	relation_classifier = pipeline("text-classification", model="GEODE/bert-base-multilingual-cased-classification-relation", truncation=True, device=device)

	def get_context(text, span, ngram_context_size=5):
	word = span["word"]
	start = span["start"]
	end = span["end"]
	label = span["entity_group"]

	# Extract context
	previous_text = text[:start].strip()
	next_text = text[end:].strip()
	previous_words = previous_text.split()[-ngram_context_size:]
	next_words = next_text.split()[:ngram_context_size]

	# Build context string
	context = f"[{word}]: {' '.join(previous_words)} {word} {' '.join(next_words)}"
	return word, context, label

	content = "WINCHESTER, (Géog. mod.) ou plutôt Wintchester, ville d'Angleterre, capitale du Hampshire, sur le bord de l'Itching, à dix-huit milles au sud-est de Salisbury, & à soixante sud-ouest de Londres. Long. 16. 20. latit. 51. 3."

	spans = ner(content)
	for span in spans:
	if span['entity_group'] == 'Relation':
	word, context, label = get_context(content, span, ngram_context_size=5)
	print(f"Relation: {word}")

	label = relation_classifier(context)
	print(f"Predicted label: {label}")


	# Output
	Relation: sur le bord de
	Predicted label: [{'label': 'Crosses', 'score': 0.9778845906257629}]
	Relation: à dix-huit milles au sud-est de
	Predicted label: [{'label': 'Distance-Orientation', 'score': 0.9959626793861389}]
	Relation: à soixante sud-ouest de
	Predicted label: [{'label': 'Distance-Orientation', 'score': 0.9963018894195557}]

	```


	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	This model was trained entirely on French encyclopaedic entries classified as Geography and will likely not perform well on text in other languages or other corpora.



	## Acknowledgement

	The authors are grateful to the [ASLAN project](https://aslan.universite-lyon.fr) (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).
	Data courtesy the [ARTFL Encyclopédie Project](https://artfl-project.uchicago.edu), University of Chicago.