Ner-model / README.md
violetar's picture
Create README.md
8377af9 verified
---
language: en
license: apache-2.0
tags:
- token-classification
- named-entity-recognition
- conll2003
- modernbert
datasets:
- lhoestq/conll2003
metrics:
- seqeval
library_name: transformers
pipeline_tag: token-classification
---
# Model Card for ModernBERT-large fine-tuned on CoNLL-2003 (NER)
A **Named Entity Recognition** model based on `answerdotai/ModernBERT-large`, fine-tuned on the English CoNLL-2003 dataset. It identifies and classifies entities into four types: **Person**, **Organization**, **Location**, and **Miscellaneous**.
## Model Details
- **Base model:** [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
- **Task:** Token classification (NER)
- **Dataset:** [lhoestq/conll2003](https://huggingface.co/datasets/lhoestq/conll2003) (CoNLL-2003 English)
- **Number of labels:** 9 (BIO format)
- O (0)
- B-PER (1), I-PER (2)
- B-ORG (3), I-ORG (4)
- B-LOC (5), I-LOC (6)
- B-MISC (7), I-MISC (8)
- **Training procedure:** Fine-tuning with Optuna hyperparameter search (20 trials)
- **Evaluation metric:** `seqeval` (overall precision, recall, F1, accuracy)
### Label Mapping
| Label ID | Entity Tag |
|----------|-------------|
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
| 7 | B-MISC |
| 8 | I-MISC |
## Training Procedure
### Hyperparameter Search
An **Optuna** study (20 trials) maximized validation F1 over the following search space:
- Learning rate: `[1e-5, 5e-4]` (log scale)
- Batch size per device: `[8, 16, 32]`
- Number of epochs: `[2, 6]`
- Weight decay: `[0.0, 0.1]`
- Warmup ratio: `[0.0, 0.2]`
- Gradient accumulation steps: `[1, 4]`
Other fixed training arguments:
- Evaluation batch size: 8
- Max sequence length: 256
- Evaluation strategy: epoch
- Save strategy: epoch
- Best model selection based on validation F1
- Seed: 42
### Training Data
- **Training set:** CoNLL-2003 `train` split
- **Validation set:** CoNLL-2003 `validation` split (used for early stopping / best model selection)
- **Test set:** CoNLL-2003 `test` split (final evaluation)
### Tokenizer Alignment
During tokenization, the original tokens are split into subwords. Subword tokens that are continuations of the same word are assigned the **inside label** of the corresponding entity class, if applicable. For example, if “Microsoft” is tokenized into `["Micro", "##soft"]` and the original tag is `B-ORG`, the first subword gets `B-ORG` and the second gets `I-ORG`. This is implemented in the `align_labels` function.
## Evaluation Results
After hyperparameter search, the best trial achieved the following results on the **test** set:
- **Precision:** 0.87
- **Recall:** 0.91
- **F1:** 0.89
- **Accuracy:** 0.97
## How to Use
### Quick Pipeline
```python
from transformers import pipeline
ner = pipeline("token-classification", model="violetar/ner-model", aggregation_strategy="simple")
sentence = "John Smith works at Microsoft in New York."
results = ner(sentence)
for entity in results:
print(f"{entity['word']} -> {entity['entity_group']} (score: {entity['score']:.2f})")
```