--- license: mit datasets: - ontonotes/conll2012_ontonotesv5 language: - en base_model: - FacebookAI/roberta-large pipeline_tag: token-classification --- # RoBERTa-large fine-tuned on OntoNotes 5.0 This model is a fine-tuned version of [FacebookAI/roberta-large](https://huggingface.co/FacebookAI/roberta-large) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. RoBERTa-large features 24 layers and ~355M parameters, providing enhanced semantic understanding for complex Named Entity Recognition (NER) tasks compared to the base architecture. ## 📊 Performance The following results were achieved on the OntoNotes 5.0 (v12) test set: | **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** | | :--- | :---: | :---: | :---: | :---: | | CARDINAL | 0.7769 | 0.7900 | 0.7834 | 1005 | | DATE | 0.8211 | 0.8533 | 0.8369 | 1786 | | EVENT | 0.5702 | 0.7647 | 0.6533 | 85 | | FAC | 0.7123 | 0.6980 | 0.7051 | 149 | | GPE | 0.9262 | 0.9470 | 0.9365 | 2546 | | LANGUAGE | 0.7500 | 0.6818 | 0.7143 | 22 | | LAW | 0.5000 | 0.6364 | 0.5600 | 44 | | LOC | 0.6597 | 0.7302 | 0.6932 | 215 | | MONEY | 0.8730 | 0.9099 | 0.8910 | 355 | | NORP | 0.9029 | 0.9485 | 0.9251 | 990 | | ORDINAL | 0.6936 | 0.7874 | 0.7376 | 207 | | ORG | 0.8870 | 0.9101 | 0.8984 | 2002 | | PERCENT | 0.8703 | 0.9066 | 0.8881 | 407 | | PERSON | 0.9250 | 0.9246 | 0.9248 | 2134 | | PRODUCT | 0.7356 | 0.7111 | 0.7232 | 90 | | QUANTITY | 0.6933 | 0.6797 | 0.6865 | 153 | | TIME | 0.6211 | 0.6267 | 0.6239 | 225 | | WORK_OF_ART | 0.6686 | 0.6923 | 0.6802 | 169 | | **micro avg** | **0.8581** | **0.8831** | **0.8704** | **12584** | | **macro avg** | **0.7548** | **0.7888** | **0.7701** | **12584** | | **weighted avg** | **0.8596** | **0.8831** | **0.8710** | **12584** | ## 🛠 Training Details To optimize the 24-layer transformer on 2xNVIDIA V100 GPUs: - **Architecture**: `RobertaForTokenClassification` - **Tokenizer**: `RobertaTokenizerFast` (with `add_prefix_space=True`) - **Learning Rate**: 1e-5 - **Effective Batch Size**: 32 (4 per device × 4 gradient accumulation steps) - **Epochs**: 5 - **Warmup Ratio**: 0.1 - **Mixed Precision**: FP16 enabled - **Optimizer**: AdamW with `weight_decay=0.01` ## 📂 Project Assets - **GitHub Repository**: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git) | **Asset** | **File** | **Description** | | :--- | :--- | :--- | | **Model Weights** | `model.safetensors` | Fine-tuned Large weights (~1.42 GB). | | **Configuration** | `config.json` | 24-layer configuration and `id2label` map. | | **Vocabulary** | `vocab.json` / `merges.txt` | BPE vocabulary and byte-level merge rules. | | **Tokenizer** | `tokenizer.json` / `tokenizer_config.json` | Complete fast tokenizer setup. | | **Special Tokens** | `special_tokens_map.json` | Definitions for BOS, EOS, and Padding tokens. | | **Training Args** | `training_args.bin` | Hyperparameters used during the training run. | ## 🚀 Usage ```python from transformers import pipeline model_checkpoint = "learnrr/roberta-large-ontonotes5-ner" token_classifier = pipeline( "token-classification", model=model_checkpoint, aggregation_strategy="simple" ) text = "The United Nations is headquartered in New York City." results = token_classifier(text) for entity in results: print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}") ```