--- language: en license: apache-2.0 tags: - token-classification - named-entity-recognition - conll2003 - modernbert datasets: - lhoestq/conll2003 metrics: - seqeval library_name: transformers pipeline_tag: token-classification --- # Model Card for ModernBERT-large fine-tuned on CoNLL-2003 (NER) A **Named Entity Recognition** model based on `answerdotai/ModernBERT-large`, fine-tuned on the English CoNLL-2003 dataset. It identifies and classifies entities into four types: **Person**, **Organization**, **Location**, and **Miscellaneous**. ## Model Details - **Base model:** [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) - **Task:** Token classification (NER) - **Dataset:** [lhoestq/conll2003](https://huggingface.co/datasets/lhoestq/conll2003) (CoNLL-2003 English) - **Number of labels:** 9 (BIO format) - O (0) - B-PER (1), I-PER (2) - B-ORG (3), I-ORG (4) - B-LOC (5), I-LOC (6) - B-MISC (7), I-MISC (8) - **Training procedure:** Fine-tuning with Optuna hyperparameter search (20 trials) - **Evaluation metric:** `seqeval` (overall precision, recall, F1, accuracy) ### Label Mapping | Label ID | Entity Tag | |----------|-------------| | 0 | O | | 1 | B-PER | | 2 | I-PER | | 3 | B-ORG | | 4 | I-ORG | | 5 | B-LOC | | 6 | I-LOC | | 7 | B-MISC | | 8 | I-MISC | ## Training Procedure ### Hyperparameter Search An **Optuna** study (20 trials) maximized validation F1 over the following search space: - Learning rate: `[1e-5, 5e-4]` (log scale) - Batch size per device: `[8, 16, 32]` - Number of epochs: `[2, 6]` - Weight decay: `[0.0, 0.1]` - Warmup ratio: `[0.0, 0.2]` - Gradient accumulation steps: `[1, 4]` Other fixed training arguments: - Evaluation batch size: 8 - Max sequence length: 256 - Evaluation strategy: epoch - Save strategy: epoch - Best model selection based on validation F1 - Seed: 42 ### Training Data - **Training set:** CoNLL-2003 `train` split - **Validation set:** CoNLL-2003 `validation` split (used for early stopping / best model selection) - **Test set:** CoNLL-2003 `test` split (final evaluation) ### Tokenizer Alignment During tokenization, the original tokens are split into subwords. Subword tokens that are continuations of the same word are assigned the **inside label** of the corresponding entity class, if applicable. For example, if “Microsoft” is tokenized into `["Micro", "##soft"]` and the original tag is `B-ORG`, the first subword gets `B-ORG` and the second gets `I-ORG`. This is implemented in the `align_labels` function. ## Evaluation Results After hyperparameter search, the best trial achieved the following results on the **test** set: - **Precision:** 0.87 - **Recall:** 0.91 - **F1:** 0.89 - **Accuracy:** 0.97 ## How to Use ### Quick Pipeline ```python from transformers import pipeline ner = pipeline("token-classification", model="violetar/ner-model", aggregation_strategy="simple") sentence = "John Smith works at Microsoft in New York." results = ner(sentence) for entity in results: print(f"{entity['word']} -> {entity['entity_group']} (score: {entity['score']:.2f})") ```