learnrr's picture
Create README.md
d86ddb0 verified
---
license: mit
datasets:
- ontonotes/conll2012_ontonotesv5
language:
- en
base_model:
- FacebookAI/roberta-base
pipeline_tag: token-classification
---
# RoBERTa-base fine-tuned on OntoNotes 5.0
This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. RoBERTa optimizes the BERT pretraining approach by using dynamic masking, removing next-sentence prediction, and training on larger batches with byte-level BPE.
## πŸ“Š Performance
The following results were achieved on the OntoNotes 5.0 (v12) test set:
| **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** |
| :--- | :---: | :---: | :---: | :---: |
| CARDINAL | 0.7813 | 0.8070 | 0.7939 | 1005 |
| DATE | 0.8032 | 0.8729 | 0.8366 | 1786 |
| EVENT | 0.5566 | 0.6941 | 0.6178 | 85 |
| FAC | 0.7059 | 0.6443 | 0.6737 | 149 |
| GPE | 0.9283 | 0.9356 | 0.9319 | 2546 |
| LANGUAGE | 0.8667 | 0.5909 | 0.7027 | 22 |
| LAW | 0.4643 | 0.5909 | 0.5200 | 44 |
| LOC | 0.7354 | 0.7628 | 0.7489 | 215 |
| MONEY | 0.8414 | 0.8817 | 0.8611 | 355 |
| NORP | 0.9004 | 0.9404 | 0.9200 | 990 |
| ORDINAL | 0.6944 | 0.8454 | 0.7625 | 207 |
| ORG | 0.8653 | 0.8986 | 0.8816 | 2002 |
| PERCENT | 0.8605 | 0.9091 | 0.8841 | 407 |
| PERSON | 0.9083 | 0.9236 | 0.9159 | 2134 |
| PRODUCT | 0.6771 | 0.7222 | 0.6989 | 90 |
| QUANTITY | 0.7410 | 0.6732 | 0.7055 | 153 |
| TIME | 0.5670 | 0.6578 | 0.6091 | 225 |
| WORK_OF_ART | 0.6105 | 0.6864 | 0.6462 | 169 |
| **micro avg** | **0.8471** | **0.8822** | **0.8643** | **12584** |
| **macro avg** | **0.7504** | **0.7798** | **0.7617** | **12584** |
| **weighted avg** | **0.8497** | **0.8822** | **0.8652** | **12584** |
## πŸ›  Training Details
- **Architecture**: `RobertaForTokenClassification`
- **Tokenizer**: `RobertaTokenizerFast` (with `add_prefix_space=True`)
- **Epochs**: 5
- **Learning Rate**: 2e-5
- **Batch Size**: 16 per device (Total 32 on 2x V100 GPUs)
- **Max Sequence Length**: 128
- **Weight Decay**: 0.01
- **Mixed Precision (FP16)**: Enabled
## πŸ“‚ Project Assets
- **GitHub Repository**: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git)
| **Asset** | **File** | **Description** |
| :--- | :--- | :--- |
| **Model Weights** | `model.safetensors` | Fine-tuned RoBERTa weights (~496 MB). |
| **Configuration** | `config.json` | Model architecture & `id2label` mappings. |
| **Vocabulary** | `vocab.json` / `merges.txt` | Byte-level BPE vocabulary and merge rules. |
| **Tokenizer** | `tokenizer.json` | Full fast tokenizer configuration. |
| **Special Tokens** | `special_tokens_map.json` | Definitions for BOS, EOS, and Padding tokens. |
| **Training Args** | `training_args.bin` | Detailed dump of the training hyperparameters. |
## πŸš€ Usage
```python
from transformers import pipeline
model_checkpoint = "learnrr/roberta-base-ontonotes5-ner"
token_classifier = pipeline(
"token-classification",
model=model_checkpoint,
aggregation_strategy="simple"
)
text = "Microsoft Corporation was founded by Bill Gates and Paul Allen."
results = token_classifier(text)
for entity in results:
print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")
```