File size: 3,236 Bytes
8377af9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language: en
license: apache-2.0
tags:
- token-classification
- named-entity-recognition
- conll2003
- modernbert
datasets:
- lhoestq/conll2003
metrics:
- seqeval
library_name: transformers
pipeline_tag: token-classification
---

# Model Card for ModernBERT-large fine-tuned on CoNLL-2003 (NER)

A **Named Entity Recognition** model based on `answerdotai/ModernBERT-large`, fine-tuned on the English CoNLL-2003 dataset. It identifies and classifies entities into four types: **Person**, **Organization**, **Location**, and **Miscellaneous**.

## Model Details

- **Base model:** [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
- **Task:** Token classification (NER)
- **Dataset:** [lhoestq/conll2003](https://huggingface.co/datasets/lhoestq/conll2003) (CoNLL-2003 English)
- **Number of labels:** 9 (BIO format)
  - O (0)
  - B-PER (1), I-PER (2)
  - B-ORG (3), I-ORG (4)
  - B-LOC (5), I-LOC (6)
  - B-MISC (7), I-MISC (8)
- **Training procedure:** Fine-tuning with Optuna hyperparameter search (20 trials)
- **Evaluation metric:** `seqeval` (overall precision, recall, F1, accuracy)

### Label Mapping
| Label ID | Entity Tag |
|----------|-------------|
| 0        | O           |
| 1        | B-PER       |
| 2        | I-PER       |
| 3        | B-ORG       |
| 4        | I-ORG       |
| 5        | B-LOC       |
| 6        | I-LOC       |
| 7        | B-MISC      |
| 8        | I-MISC      |

## Training Procedure

### Hyperparameter Search

An **Optuna** study (20 trials) maximized validation F1 over the following search space:

- Learning rate: `[1e-5, 5e-4]` (log scale)
- Batch size per device: `[8, 16, 32]`
- Number of epochs: `[2, 6]`
- Weight decay: `[0.0, 0.1]`
- Warmup ratio: `[0.0, 0.2]`
- Gradient accumulation steps: `[1, 4]`

Other fixed training arguments:
- Evaluation batch size: 8
- Max sequence length: 256
- Evaluation strategy: epoch
- Save strategy: epoch
- Best model selection based on validation F1
- Seed: 42

### Training Data

- **Training set:** CoNLL-2003 `train` split
- **Validation set:** CoNLL-2003 `validation` split (used for early stopping / best model selection)
- **Test set:** CoNLL-2003 `test` split (final evaluation)

### Tokenizer Alignment

During tokenization, the original tokens are split into subwords. Subword tokens that are continuations of the same word are assigned the **inside label** of the corresponding entity class, if applicable. For example, if “Microsoft” is tokenized into `["Micro", "##soft"]` and the original tag is `B-ORG`, the first subword gets `B-ORG` and the second gets `I-ORG`. This is implemented in the `align_labels` function.

## Evaluation Results

After hyperparameter search, the best trial achieved the following results on the **test** set:

- **Precision:** 0.87
- **Recall:** 0.91
- **F1:** 0.89
- **Accuracy:** 0.97


## How to Use

### Quick Pipeline

```python
from transformers import pipeline

ner = pipeline("token-classification", model="violetar/ner-model", aggregation_strategy="simple")
sentence = "John Smith works at Microsoft in New York."
results = ner(sentence)

for entity in results:
    print(f"{entity['word']} -> {entity['entity_group']} (score: {entity['score']:.2f})")
```