| --- |
| license: mit |
| datasets: |
| - ontonotes/conll2012_ontonotesv5 |
| language: |
| - en |
| base_model: |
| - FacebookAI/roberta-large |
| pipeline_tag: token-classification |
| --- |
| |
| # RoBERTa-large fine-tuned on OntoNotes 5.0 |
|
|
| This model is a fine-tuned version of [FacebookAI/roberta-large](https://huggingface.co/FacebookAI/roberta-large) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. RoBERTa-large features 24 layers and ~355M parameters, providing enhanced semantic understanding for complex Named Entity Recognition (NER) tasks compared to the base architecture. |
|
|
| ## π Performance |
| The following results were achieved on the OntoNotes 5.0 (v12) test set: |
|
|
| | **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** | |
| | :--- | :---: | :---: | :---: | :---: | |
| | CARDINAL | 0.7769 | 0.7900 | 0.7834 | 1005 | |
| | DATE | 0.8211 | 0.8533 | 0.8369 | 1786 | |
| | EVENT | 0.5702 | 0.7647 | 0.6533 | 85 | |
| | FAC | 0.7123 | 0.6980 | 0.7051 | 149 | |
| | GPE | 0.9262 | 0.9470 | 0.9365 | 2546 | |
| | LANGUAGE | 0.7500 | 0.6818 | 0.7143 | 22 | |
| | LAW | 0.5000 | 0.6364 | 0.5600 | 44 | |
| | LOC | 0.6597 | 0.7302 | 0.6932 | 215 | |
| | MONEY | 0.8730 | 0.9099 | 0.8910 | 355 | |
| | NORP | 0.9029 | 0.9485 | 0.9251 | 990 | |
| | ORDINAL | 0.6936 | 0.7874 | 0.7376 | 207 | |
| | ORG | 0.8870 | 0.9101 | 0.8984 | 2002 | |
| | PERCENT | 0.8703 | 0.9066 | 0.8881 | 407 | |
| | PERSON | 0.9250 | 0.9246 | 0.9248 | 2134 | |
| | PRODUCT | 0.7356 | 0.7111 | 0.7232 | 90 | |
| | QUANTITY | 0.6933 | 0.6797 | 0.6865 | 153 | |
| | TIME | 0.6211 | 0.6267 | 0.6239 | 225 | |
| | WORK_OF_ART | 0.6686 | 0.6923 | 0.6802 | 169 | |
| | **micro avg** | **0.8581** | **0.8831** | **0.8704** | **12584** | |
| | **macro avg** | **0.7548** | **0.7888** | **0.7701** | **12584** | |
| | **weighted avg** | **0.8596** | **0.8831** | **0.8710** | **12584** | |
|
|
| ## π Training Details |
| To optimize the 24-layer transformer on 2xNVIDIA V100 GPUs: |
| - **Architecture**: `RobertaForTokenClassification` |
| - **Tokenizer**: `RobertaTokenizerFast` (with `add_prefix_space=True`) |
| - **Learning Rate**: 1e-5 |
| - **Effective Batch Size**: 32 (4 per device Γ 4 gradient accumulation steps) |
| - **Epochs**: 5 |
| - **Warmup Ratio**: 0.1 |
| - **Mixed Precision**: FP16 enabled |
| - **Optimizer**: AdamW with `weight_decay=0.01` |
|
|
| ## π Project Assets |
| - **GitHub Repository**: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git) |
|
|
| | **Asset** | **File** | **Description** | |
| | :--- | :--- | :--- | |
| | **Model Weights** | `model.safetensors` | Fine-tuned Large weights (~1.42 GB). | |
| | **Configuration** | `config.json` | 24-layer configuration and `id2label` map. | |
| | **Vocabulary** | `vocab.json` / `merges.txt` | BPE vocabulary and byte-level merge rules. | |
| | **Tokenizer** | `tokenizer.json` / `tokenizer_config.json` | Complete fast tokenizer setup. | |
| | **Special Tokens** | `special_tokens_map.json` | Definitions for BOS, EOS, and Padding tokens. | |
| | **Training Args** | `training_args.bin` | Hyperparameters used during the training run. | |
|
|
| ## π Usage |
| ```python |
| from transformers import pipeline |
| |
| model_checkpoint = "learnrr/roberta-large-ontonotes5-ner" |
| token_classifier = pipeline( |
| "token-classification", |
| model=model_checkpoint, |
| aggregation_strategy="simple" |
| ) |
| |
| text = "The United Nations is headquartered in New York City." |
| results = token_classifier(text) |
| |
| for entity in results: |
| print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}") |
| ``` |