| --- |
| license: mit |
| datasets: |
| - ontonotes/conll2012_ontonotesv5 |
| language: |
| - en |
| base_model: |
| - google-bert/bert-base-cased |
| pipeline_tag: token-classification |
| --- |
| # BERT-base-cased fine-tuned on OntoNotes 5.0 |
|
|
| This model is a fine-tuned version of [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities. |
|
|
| ## π Performance |
| The model achieves the following results on the OntoNotes 5.0 test set: |
|
|
| | **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** | |
| | :--- | :---: | :---: | :---: | :---: | |
| | CARDINAL | 0.7776 | 0.8070 | 0.7920 | 1005 | |
| | DATE | 0.7943 | 0.8628 | 0.8272 | 1786 | |
| | EVENT | 0.5000 | 0.6235 | 0.5550 | 85 | |
| | FAC | 0.6081 | 0.6040 | 0.6061 | 149 | |
| | GPE | 0.9243 | 0.9156 | 0.9199 | 2546 | |
| | LANGUAGE | 0.7500 | 0.6818 | 0.7143 | 22 | |
| | LAW | 0.5200 | 0.5909 | 0.5532 | 44 | |
| | LOC | 0.6478 | 0.7442 | 0.6926 | 215 | |
| | MONEY | 0.8760 | 0.9155 | 0.8953 | 355 | |
| | NORP | 0.8956 | 0.9182 | 0.9067 | 990 | |
| | ORDINAL | 0.7252 | 0.7778 | 0.7506 | 207 | |
| | ORG | 0.8621 | 0.8991 | 0.8802 | 2002 | |
| | PERCENT | 0.8575 | 0.9017 | 0.8790 | 407 | |
| | PERSON | 0.9080 | 0.9161 | 0.9121 | 2134 | |
| | PRODUCT | 0.5918 | 0.6444 | 0.6170 | 90 | |
| | QUANTITY | 0.7042 | 0.6536 | 0.6780 | 153 | |
| | TIME | 0.5906 | 0.6667 | 0.6263 | 225 | |
| | WORK_OF_ART | 0.6022 | 0.6450 | 0.6229 | 169 | |
| | **micro avg** | **0.8413** | **0.8710** | **0.8559** | **12584** | |
| | **macro avg** | **0.7297** | **0.7649** | **0.7460** | **12584** | |
| | **weighted avg** | **0.8440** | **0.8710** | **0.8570** | **12584** | |
|
|
| ## π Training Details |
| - **Architecture**: `BertForTokenClassification` |
| - **Tokenizer**: `BertTokenizerFast` (using `is_split_into_words=True` for alignment) |
| - **Epochs**: 5 |
| - **Learning Rate**: 2e-5 |
| - **Batch Size**: 16 per device (Total 32 on 2x V100 GPUs) |
| - **Max Sequence Length**: 128 |
| - **Weight Decay**: 0.01 |
| - **Mixed Precision (FP16)**: Enabled |
|
|
| ## π Labels Mapping |
| The model was trained with the following label mapping (18 OntoNotes entities + BIO tags): |
| `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. |
|
|
| ## π Project Assets |
| - **GitHub Repository**: https://github.com/Learnrr/ontonotes5_ner_evaluation.git |
| | **Asset** | **File** | **Description** | |
| | :--- | :--- | :--- | |
| | **Model Weights** | `model.safetensors` | Main checkpoint in Safetensors format (safe, fast loading, ~431 MB). | |
| | **Configuration** | `config.json` | Model architecture settings and `id2label` entity mapping. | |
| | **Vocabulary** | `vocab.txt` | BERT-cased WordPiece vocabulary for tokenization. | |
| | **Tokenizer** | `tokenizer.json` / `tokenizer_config.json` | Optimized fast tokenizer configuration and serialization. | |
| | **Special Tokens** | `special_tokens_map.json` | Definitions for special tokens like `[CLS]`, `[SEP]`, etc. | |
| | **Training Args** | `training_args.bin` | Detailed hyperparameter settings used during the training run. | |
|
|
| ## π Usage |
| You can use this model directly with a pipeline for token classification: |
| ```python |
| from transformers import pipeline |
| |
| model_checkpoint = "learnrr/bert-base-ontonotes5-ner" |
| token_classifier = pipeline( |
| "token-classification", |
| model=model_checkpoint, |
| aggregation_strategy="simple" |
| ) |
| |
| text = "Apple was founded by Steve Jobs in Cupertino." |
| results = token_classifier(text) |
| |
| for entity in results: |
| print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}") |
| ``` |