Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- ontonotes/conll2012_ontonotesv5
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
base_model:
|
| 8 |
+
- google-bert/bert-base-cased
|
| 9 |
+
pipeline_tag: token-classification
|
| 10 |
+
---
|
| 11 |
+
# BERT-base-cased fine-tuned on OntoNotes 5.0
|
| 12 |
+
|
| 13 |
+
This model is a fine-tuned version of [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities.
|
| 14 |
+
|
| 15 |
+
## ๐ Performance
|
| 16 |
+
The model achieves the following results on the OntoNotes 5.0 test set:
|
| 17 |
+
|
| 18 |
+
| **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** |
|
| 19 |
+
| :--- | :---: | :---: | :---: | :---: |
|
| 20 |
+
| CARDINAL | 0.7776 | 0.8070 | 0.7920 | 1005 |
|
| 21 |
+
| DATE | 0.7943 | 0.8628 | 0.8272 | 1786 |
|
| 22 |
+
| EVENT | 0.5000 | 0.6235 | 0.5550 | 85 |
|
| 23 |
+
| FAC | 0.6081 | 0.6040 | 0.6061 | 149 |
|
| 24 |
+
| GPE | 0.9243 | 0.9156 | 0.9199 | 2546 |
|
| 25 |
+
| LANGUAGE | 0.7500 | 0.6818 | 0.7143 | 22 |
|
| 26 |
+
| LAW | 0.5200 | 0.5909 | 0.5532 | 44 |
|
| 27 |
+
| LOC | 0.6478 | 0.7442 | 0.6926 | 215 |
|
| 28 |
+
| MONEY | 0.8760 | 0.9155 | 0.8953 | 355 |
|
| 29 |
+
| NORP | 0.8956 | 0.9182 | 0.9067 | 990 |
|
| 30 |
+
| ORDINAL | 0.7252 | 0.7778 | 0.7506 | 207 |
|
| 31 |
+
| ORG | 0.8621 | 0.8991 | 0.8802 | 2002 |
|
| 32 |
+
| PERCENT | 0.8575 | 0.9017 | 0.8790 | 407 |
|
| 33 |
+
| PERSON | 0.9080 | 0.9161 | 0.9121 | 2134 |
|
| 34 |
+
| PRODUCT | 0.5918 | 0.6444 | 0.6170 | 90 |
|
| 35 |
+
| QUANTITY | 0.7042 | 0.6536 | 0.6780 | 153 |
|
| 36 |
+
| TIME | 0.5906 | 0.6667 | 0.6263 | 225 |
|
| 37 |
+
| WORK_OF_ART | 0.6022 | 0.6450 | 0.6229 | 169 |
|
| 38 |
+
| **micro avg** | **0.8413** | **0.8710** | **0.8559** | **12584** |
|
| 39 |
+
| **macro avg** | **0.7297** | **0.7649** | **0.7460** | **12584** |
|
| 40 |
+
| **weighted avg** | **0.8440** | **0.8710** | **0.8570** | **12584** |
|
| 41 |
+
|
| 42 |
+
## ๐ Training Details
|
| 43 |
+
- **Architecture**: `BertForTokenClassification`
|
| 44 |
+
- **Tokenizer**: `BertTokenizerFast` (using `is_split_into_words=True` for alignment)
|
| 45 |
+
- **Epochs**: 5
|
| 46 |
+
- **Learning Rate**: 2e-5
|
| 47 |
+
- **Batch Size**: 16 per device (Total 32 on 2x V100 GPUs)
|
| 48 |
+
- **Max Sequence Length**: 128
|
| 49 |
+
- **Weight Decay**: 0.01
|
| 50 |
+
- **Mixed Precision (FP16)**: Enabled
|
| 51 |
+
|
| 52 |
+
## ๐ Labels Mapping
|
| 53 |
+
The model was trained with the following label mapping (18 OntoNotes entities + BIO tags):
|
| 54 |
+
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
|
| 55 |
+
|
| 56 |
+
## ๐ Project Assets
|
| 57 |
+
- **GitHub Repository**: https://github.com/Learnrr/ontonotes5_ner_evaluation.git
|
| 58 |
+
| **Asset** | **File** | **Description** |
|
| 59 |
+
| :--- | :--- | :--- |
|
| 60 |
+
| **Model Weights** | `model.safetensors` | Main checkpoint in Safetensors format (safe, fast loading, ~496 MB). |
|
| 61 |
+
| **Configuration** | `config.json` | Model architecture settings and `id2label` entity mapping. |
|
| 62 |
+
| **Vocabulary** | `vocab.json` / `merges.txt` | Byte-level BPE vocabulary files used for tokenization. |
|
| 63 |
+
| **Tokenizer** | `tokenizer.json` | Full fast tokenizer configuration for production inference. |
|
| 64 |
+
| **Special Tokens** | `special_tokens_map.json` | Mapping for tokens like `[CLS]`, `[SEP]`, `<mask>`, etc. |
|
| 65 |
+
| **Training Args** | `training_args.bin` | Binary file containing all hyperparameter settings used during training. |
|
| 66 |
+
|
| 67 |
+
## ๐ Usage
|
| 68 |
+
You can use this model directly with a pipeline for token classification:
|
| 69 |
+
```python
|
| 70 |
+
from transformers import pipeline
|
| 71 |
+
|
| 72 |
+
model_checkpoint = "learnrr/bert-base-ontonotes5-ner"
|
| 73 |
+
token_classifier = pipeline(
|
| 74 |
+
"token-classification",
|
| 75 |
+
model=model_checkpoint,
|
| 76 |
+
aggregation_strategy="simple"
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
text = "Apple was founded by Steve Jobs in Cupertino."
|
| 80 |
+
results = token_classifier(text)
|
| 81 |
+
|
| 82 |
+
for entity in results:
|
| 83 |
+
print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")
|
| 84 |
+
```
|