mdeberta-v3-base-multilingual-ner / README.md

Update README.md

0552570 verified 2 months ago

4.34 kB

	---
	license: other
	inference:
	parameters:
	guidance_scale: 1
	library_name: transformers
	---
	# Fine-Tuning mDeBERTa for Named Entity Recognition (NER)

	## 📌 Model Overview

	This repository contains a fine-tuned version of `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` for Named Entity Recognition (NER) using the `mnaguib/WikiNER` dataset in multiple languages.

	## 🚀 Features

	- Pretrained on mDeBERTa: A powerful multilingual model for text understanding.
	- Fine-tuned for NER: Detects entities such as persons (`PER`), locations (`LOC`), organizations (`ORG`), and more.

	## 📖 Training Details

	- Base model: `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli`
	- Dataset: `mnaguib/WikiNER`
	- Languages: English (`en`), Spanish (es), ...
	- Epochs: `2`
	- Optimizer: AdamW
	- Loss function: CrossEntropyLoss

	## Inference

	To use the model for inference:

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer

	# Load the model and tokenizer
	model_path = "jordigonzm/mdeberta-v3-base-multilingual-ner"
	model = AutoModelForTokenClassification.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	# NER Prediction Function
	def predict_ner(text):
	tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
	outputs = model(**tokens)
	predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()
	tokens_decoded = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
	return list(zip(tokens_decoded, predictions))

	# Example
	text = "The Mona Lisa is located in the Louvre Museum, in Paris."
	result = predict_ner(text)
	print(result)
	```

	## Post-processing function

	```
	from transformers import AutoModelForTokenClassification, AutoTokenizer
	from rich.console import Console
	from rich.table import Table

	# Load the model and tokenizer
	model_path = "jordigonzm/mdeberta-v3-base-multilingual-ner"
	model = AutoModelForTokenClassification.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	# NER label mapping
	id_to_label = {0: "0", 1: "LOC", 2: "PER", 3: "MISC", 4: "ORG"}

	# Post-processing function to merge subtokens
	def postprocess_ner(decoded_tokens, predictions):
	ner_results = []
	current_word = ""
	current_label = None

	for token, label in zip(decoded_tokens, predictions):
	if token in ["[CLS]", "[SEP]", "[PAD]"]:
	continue # Ignore special tokens

	if token.startswith("▁"): # New word token
	if current_word:
	ner_results.append((current_word, id_to_label.get(current_label, "O")))
	current_word = token[1:] # Remove prefix '▁'
	current_label = label
	else: # Subtoken, append to the current word
	current_word += token

	if current_word: # Add the last word
	ner_results.append((current_word, id_to_label.get(current_label, "O")))

	return ner_results

	# NER Prediction Function
	def predict_ner(text):
	tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
	outputs = model(**tokens)
	predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()
	decoded_tokens = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
	entities = postprocess_ner(decoded_tokens, predictions)
	return entities

	# Display Results
	def display_ner_results(results):
	console = Console()
	table = Table(title="Entity Classification", show_lines=True)

	table.add_column("Token", justify="left", style="cyan")
	table.add_column("Entity", justify="center", style="magenta")

	for token, entity in results:
	if entity != "0":
	print(f"{token:<10} -> {entity}")
	table.add_row(token, str(entity))

	console.print(table)

	# Example
	text = "The Mona Lisa is located in the Louvre Museum, in Paris."
	result = predict_ner(text)
	display_ner_results(result)
	```
	## Model Usage

	You can load the model directly from Hugging Face:

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer

	model = AutoModelForTokenClassification.from_pretrained("jordigonzm/mdeberta-v3-base-multilingual-ner")
	tokenizer = AutoTokenizer.from_pretrained("jordigonzm/mdeberta-v3-base-multilingual-ner")
	```

	---