| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - token-classification |
| - named-entity-recognition |
| - conll2003 |
| - modernbert |
| datasets: |
| - lhoestq/conll2003 |
| metrics: |
| - seqeval |
| library_name: transformers |
| pipeline_tag: token-classification |
| --- |
| |
| # Model Card for ModernBERT-large fine-tuned on CoNLL-2003 (NER) |
|
|
| A **Named Entity Recognition** model based on `answerdotai/ModernBERT-large`, fine-tuned on the English CoNLL-2003 dataset. It identifies and classifies entities into four types: **Person**, **Organization**, **Location**, and **Miscellaneous**. |
|
|
| ## Model Details |
|
|
| - **Base model:** [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) |
| - **Task:** Token classification (NER) |
| - **Dataset:** [lhoestq/conll2003](https://huggingface.co/datasets/lhoestq/conll2003) (CoNLL-2003 English) |
| - **Number of labels:** 9 (BIO format) |
| - O (0) |
| - B-PER (1), I-PER (2) |
| - B-ORG (3), I-ORG (4) |
| - B-LOC (5), I-LOC (6) |
| - B-MISC (7), I-MISC (8) |
| - **Training procedure:** Fine-tuning with Optuna hyperparameter search (20 trials) |
| - **Evaluation metric:** `seqeval` (overall precision, recall, F1, accuracy) |
|
|
| ### Label Mapping |
| | Label ID | Entity Tag | |
| |----------|-------------| |
| | 0 | O | |
| | 1 | B-PER | |
| | 2 | I-PER | |
| | 3 | B-ORG | |
| | 4 | I-ORG | |
| | 5 | B-LOC | |
| | 6 | I-LOC | |
| | 7 | B-MISC | |
| | 8 | I-MISC | |
|
|
| ## Training Procedure |
|
|
| ### Hyperparameter Search |
|
|
| An **Optuna** study (20 trials) maximized validation F1 over the following search space: |
|
|
| - Learning rate: `[1e-5, 5e-4]` (log scale) |
| - Batch size per device: `[8, 16, 32]` |
| - Number of epochs: `[2, 6]` |
| - Weight decay: `[0.0, 0.1]` |
| - Warmup ratio: `[0.0, 0.2]` |
| - Gradient accumulation steps: `[1, 4]` |
|
|
| Other fixed training arguments: |
| - Evaluation batch size: 8 |
| - Max sequence length: 256 |
| - Evaluation strategy: epoch |
| - Save strategy: epoch |
| - Best model selection based on validation F1 |
| - Seed: 42 |
|
|
| ### Training Data |
|
|
| - **Training set:** CoNLL-2003 `train` split |
| - **Validation set:** CoNLL-2003 `validation` split (used for early stopping / best model selection) |
| - **Test set:** CoNLL-2003 `test` split (final evaluation) |
|
|
| ### Tokenizer Alignment |
|
|
| During tokenization, the original tokens are split into subwords. Subword tokens that are continuations of the same word are assigned the **inside label** of the corresponding entity class, if applicable. For example, if “Microsoft” is tokenized into `["Micro", "##soft"]` and the original tag is `B-ORG`, the first subword gets `B-ORG` and the second gets `I-ORG`. This is implemented in the `align_labels` function. |
|
|
| ## Evaluation Results |
|
|
| After hyperparameter search, the best trial achieved the following results on the **test** set: |
|
|
| - **Precision:** 0.87 |
| - **Recall:** 0.91 |
| - **F1:** 0.89 |
| - **Accuracy:** 0.97 |
|
|
|
|
| ## How to Use |
|
|
| ### Quick Pipeline |
|
|
| ```python |
| from transformers import pipeline |
| |
| ner = pipeline("token-classification", model="violetar/ner-model", aggregation_strategy="simple") |
| sentence = "John Smith works at Microsoft in New York." |
| results = ner(sentence) |
| |
| for entity in results: |
| print(f"{entity['word']} -> {entity['entity_group']} (score: {entity['score']:.2f})") |
| ``` |
|
|