File size: 4,341 Bytes
1ad7416
cb695a6
a051538
 
0552570
a29b65e
1ad7416
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc2807f
1ad7416
 
 
 
 
 
 
22e7bfd
1ad7416
 
 
 
 
 
14f021c
 
1ad7416
 
 
 
22239fd
1ad7416
 
 
 
cc2807f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ad7416
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: other
inference:
  parameters:
    guidance_scale: 1
library_name: transformers
---
# Fine-Tuning mDeBERTa for Named Entity Recognition (NER)

## 📌 Model Overview

This repository contains a fine-tuned version of `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` for **Named Entity Recognition (NER)** using the `mnaguib/WikiNER` dataset in multiple languages.

## 🚀 Features

- **Pretrained on mDeBERTa**: A powerful multilingual model for text understanding.
- **Fine-tuned for NER**: Detects entities such as persons (`PER`), locations (`LOC`), organizations (`ORG`), and more.

## 📖 Training Details

- **Base model**: `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli`
- **Dataset**: `mnaguib/WikiNER`
- **Languages**:  English (`en`), Spanish (es), ...
- **Epochs**: `2`
- **Optimizer**: AdamW
- **Loss function**: CrossEntropyLoss

## Inference

To use the model for inference:

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer

# Load the model and tokenizer
model_path = "jordigonzm/mdeberta-v3-base-multilingual-ner"
model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# NER Prediction Function
def predict_ner(text):
    tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
    outputs = model(**tokens)
    predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()
    tokens_decoded = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
    return list(zip(tokens_decoded, predictions))

# Example
text = "The Mona Lisa is located in the Louvre Museum, in Paris."
result = predict_ner(text)
print(result)
```

## Post-processing function

```
from transformers import AutoModelForTokenClassification, AutoTokenizer
from rich.console import Console
from rich.table import Table

# Load the model and tokenizer
model_path = "jordigonzm/mdeberta-v3-base-multilingual-ner"
model = AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# NER label mapping
id_to_label = {0: "0", 1: "LOC", 2: "PER", 3: "MISC", 4: "ORG"}

# Post-processing function to merge subtokens
def postprocess_ner(decoded_tokens, predictions):
    ner_results = []
    current_word = ""
    current_label = None
    
    for token, label in zip(decoded_tokens, predictions):
        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue  # Ignore special tokens
        
        if token.startswith("▁"):  # New word token
            if current_word:
                ner_results.append((current_word, id_to_label.get(current_label, "O")))
            current_word = token[1:]  # Remove prefix '▁'
            current_label = label
        else:  # Subtoken, append to the current word
            current_word += token
    
    if current_word:  # Add the last word
        ner_results.append((current_word, id_to_label.get(current_label, "O")))
    
    return ner_results
  
# NER Prediction Function
def predict_ner(text):
    tokens = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
    outputs = model(**tokens)
    predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()
    decoded_tokens = tokenizer.convert_ids_to_tokens(tokens["input_ids"].squeeze().tolist())
    entities = postprocess_ner(decoded_tokens, predictions)
    return entities

# Display Results
def display_ner_results(results):
    console = Console()
    table = Table(title="Entity Classification", show_lines=True)
    
    table.add_column("Token", justify="left", style="cyan")
    table.add_column("Entity", justify="center", style="magenta")
    
    for token, entity in results:
        if entity != "0":
            print(f"{token:<10} -> {entity}")
            table.add_row(token, str(entity))
    
    console.print(table)

# Example
text = "The Mona Lisa is located in the Louvre Museum, in Paris."
result = predict_ner(text)
display_ner_results(result)
```
## Model Usage

You can load the model directly from Hugging Face:

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("jordigonzm/mdeberta-v3-base-multilingual-ner")
tokenizer = AutoTokenizer.from_pretrained("jordigonzm/mdeberta-v3-base-multilingual-ner")
```

---