--- license: mit datasets: - ontonotes/conll2012_ontonotesv5 language: - en base_model: - FacebookAI/roberta-base pipeline_tag: token-classification --- # RoBERTa-base fine-tuned on OntoNotes 5.0 This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. RoBERTa optimizes the BERT pretraining approach by using dynamic masking, removing next-sentence prediction, and training on larger batches with byte-level BPE. ## 📊 Performance The following results were achieved on the OntoNotes 5.0 (v12) test set: | **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** | | :--- | :---: | :---: | :---: | :---: | | CARDINAL | 0.7813 | 0.8070 | 0.7939 | 1005 | | DATE | 0.8032 | 0.8729 | 0.8366 | 1786 | | EVENT | 0.5566 | 0.6941 | 0.6178 | 85 | | FAC | 0.7059 | 0.6443 | 0.6737 | 149 | | GPE | 0.9283 | 0.9356 | 0.9319 | 2546 | | LANGUAGE | 0.8667 | 0.5909 | 0.7027 | 22 | | LAW | 0.4643 | 0.5909 | 0.5200 | 44 | | LOC | 0.7354 | 0.7628 | 0.7489 | 215 | | MONEY | 0.8414 | 0.8817 | 0.8611 | 355 | | NORP | 0.9004 | 0.9404 | 0.9200 | 990 | | ORDINAL | 0.6944 | 0.8454 | 0.7625 | 207 | | ORG | 0.8653 | 0.8986 | 0.8816 | 2002 | | PERCENT | 0.8605 | 0.9091 | 0.8841 | 407 | | PERSON | 0.9083 | 0.9236 | 0.9159 | 2134 | | PRODUCT | 0.6771 | 0.7222 | 0.6989 | 90 | | QUANTITY | 0.7410 | 0.6732 | 0.7055 | 153 | | TIME | 0.5670 | 0.6578 | 0.6091 | 225 | | WORK_OF_ART | 0.6105 | 0.6864 | 0.6462 | 169 | | **micro avg** | **0.8471** | **0.8822** | **0.8643** | **12584** | | **macro avg** | **0.7504** | **0.7798** | **0.7617** | **12584** | | **weighted avg** | **0.8497** | **0.8822** | **0.8652** | **12584** | ## 🛠 Training Details - **Architecture**: `RobertaForTokenClassification` - **Tokenizer**: `RobertaTokenizerFast` (with `add_prefix_space=True`) - **Epochs**: 5 - **Learning Rate**: 2e-5 - **Batch Size**: 16 per device (Total 32 on 2x V100 GPUs) - **Max Sequence Length**: 128 - **Weight Decay**: 0.01 - **Mixed Precision (FP16)**: Enabled ## 📂 Project Assets - **GitHub Repository**: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git) | **Asset** | **File** | **Description** | | :--- | :--- | :--- | | **Model Weights** | `model.safetensors` | Fine-tuned RoBERTa weights (~496 MB). | | **Configuration** | `config.json` | Model architecture & `id2label` mappings. | | **Vocabulary** | `vocab.json` / `merges.txt` | Byte-level BPE vocabulary and merge rules. | | **Tokenizer** | `tokenizer.json` | Full fast tokenizer configuration. | | **Special Tokens** | `special_tokens_map.json` | Definitions for BOS, EOS, and Padding tokens. | | **Training Args** | `training_args.bin` | Detailed dump of the training hyperparameters. | ## 🚀 Usage ```python from transformers import pipeline model_checkpoint = "learnrr/roberta-base-ontonotes5-ner" token_classifier = pipeline( "token-classification", model=model_checkpoint, aggregation_strategy="simple" ) text = "Microsoft Corporation was founded by Bill Gates and Paul Allen." results = token_classifier(text) for entity in results: print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}") ```