| --- |
| license: mit |
| datasets: |
| - ontonotes/conll2012_ontonotesv5 |
| language: |
| - en |
| base_model: |
| - FacebookAI/roberta-base |
| pipeline_tag: token-classification |
| --- |
| |
| # RoBERTa-base fine-tuned on OntoNotes 5.0 |
|
|
| This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. RoBERTa optimizes the BERT pretraining approach by using dynamic masking, removing next-sentence prediction, and training on larger batches with byte-level BPE. |
|
|
| ## π Performance |
| The following results were achieved on the OntoNotes 5.0 (v12) test set: |
|
|
| | **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** | |
| | :--- | :---: | :---: | :---: | :---: | |
| | CARDINAL | 0.7813 | 0.8070 | 0.7939 | 1005 | |
| | DATE | 0.8032 | 0.8729 | 0.8366 | 1786 | |
| | EVENT | 0.5566 | 0.6941 | 0.6178 | 85 | |
| | FAC | 0.7059 | 0.6443 | 0.6737 | 149 | |
| | GPE | 0.9283 | 0.9356 | 0.9319 | 2546 | |
| | LANGUAGE | 0.8667 | 0.5909 | 0.7027 | 22 | |
| | LAW | 0.4643 | 0.5909 | 0.5200 | 44 | |
| | LOC | 0.7354 | 0.7628 | 0.7489 | 215 | |
| | MONEY | 0.8414 | 0.8817 | 0.8611 | 355 | |
| | NORP | 0.9004 | 0.9404 | 0.9200 | 990 | |
| | ORDINAL | 0.6944 | 0.8454 | 0.7625 | 207 | |
| | ORG | 0.8653 | 0.8986 | 0.8816 | 2002 | |
| | PERCENT | 0.8605 | 0.9091 | 0.8841 | 407 | |
| | PERSON | 0.9083 | 0.9236 | 0.9159 | 2134 | |
| | PRODUCT | 0.6771 | 0.7222 | 0.6989 | 90 | |
| | QUANTITY | 0.7410 | 0.6732 | 0.7055 | 153 | |
| | TIME | 0.5670 | 0.6578 | 0.6091 | 225 | |
| | WORK_OF_ART | 0.6105 | 0.6864 | 0.6462 | 169 | |
| | **micro avg** | **0.8471** | **0.8822** | **0.8643** | **12584** | |
| | **macro avg** | **0.7504** | **0.7798** | **0.7617** | **12584** | |
| | **weighted avg** | **0.8497** | **0.8822** | **0.8652** | **12584** | |
|
|
| ## π Training Details |
| - **Architecture**: `RobertaForTokenClassification` |
| - **Tokenizer**: `RobertaTokenizerFast` (with `add_prefix_space=True`) |
| - **Epochs**: 5 |
| - **Learning Rate**: 2e-5 |
| - **Batch Size**: 16 per device (Total 32 on 2x V100 GPUs) |
| - **Max Sequence Length**: 128 |
| - **Weight Decay**: 0.01 |
| - **Mixed Precision (FP16)**: Enabled |
|
|
| ## π Project Assets |
| - **GitHub Repository**: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git) |
|
|
| | **Asset** | **File** | **Description** | |
| | :--- | :--- | :--- | |
| | **Model Weights** | `model.safetensors` | Fine-tuned RoBERTa weights (~496 MB). | |
| | **Configuration** | `config.json` | Model architecture & `id2label` mappings. | |
| | **Vocabulary** | `vocab.json` / `merges.txt` | Byte-level BPE vocabulary and merge rules. | |
| | **Tokenizer** | `tokenizer.json` | Full fast tokenizer configuration. | |
| | **Special Tokens** | `special_tokens_map.json` | Definitions for BOS, EOS, and Padding tokens. | |
| | **Training Args** | `training_args.bin` | Detailed dump of the training hyperparameters. | |
|
|
| ## π Usage |
| ```python |
| from transformers import pipeline |
| |
| model_checkpoint = "learnrr/roberta-base-ontonotes5-ner" |
| token_classifier = pipeline( |
| "token-classification", |
| model=model_checkpoint, |
| aggregation_strategy="simple" |
| ) |
| |
| text = "Microsoft Corporation was founded by Bill Gates and Paul Allen." |
| results = token_classifier(text) |
| |
| for entity in results: |
| print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}") |
| ``` |