File size: 3,487 Bytes
97e631a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a232c87
97e631a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: mit
datasets:
- ontonotes/conll2012_ontonotesv5
language:
- en
base_model:
- google-bert/bert-large-cased
pipeline_tag: token-classification
---

# BERT-large-cased fine-tuned on OntoNotes 5.0

This model is a fine-tuned version of [google-bert/bert-large-cased](https://huggingface.co/google-bert/bert-large-cased) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. It is optimized for high-precision Named Entity Recognition (NER) across 18 entity categories.

## πŸ“Š Performance
The model achieves the following results on the OntoNotes 5.0 test set:

| **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** |
| :--- | :---: | :---: | :---: | :---: |
| CARDINAL | 0.7813 | 0.7891 | 0.7851 | 1005 |
| DATE | 0.7988 | 0.8516 | 0.8244 | 1786 |
| EVENT | 0.5619 | 0.6941 | 0.6211 | 85 |
| FAC | 0.6880 | 0.5772 | 0.6277 | 149 |
| GPE | 0.9185 | 0.9207 | 0.9196 | 2546 |
| LANGUAGE | 0.8421 | 0.7273 | 0.7805 | 22 |
| LAW | 0.4762 | 0.6818 | 0.5607 | 44 |
| LOC | 0.6337 | 0.7163 | 0.6725 | 215 |
| MONEY | 0.8636 | 0.9099 | 0.8861 | 355 |
| NORP | 0.8481 | 0.8909 | 0.8690 | 990 |
| ORDINAL | 0.7054 | 0.7633 | 0.7332 | 207 |
| ORG | 0.8690 | 0.9046 | 0.8864 | 2002 |
| PERCENT | 0.8467 | 0.8821 | 0.8640 | 407 |
| PERSON | 0.9090 | 0.9217 | 0.9153 | 2134 |
| PRODUCT | 0.6667 | 0.6889 | 0.6776 | 90 |
| QUANTITY | 0.6972 | 0.6471 | 0.6712 | 153 |
| TIME | 0.6106 | 0.6133 | 0.6120 | 225 |
| WORK_OF_ART | 0.6354 | 0.6805 | 0.6571 | 169 |
| **micro avg** | **0.8412** | **0.8675** | **0.8542** | **12584** |
| **macro avg** | **0.7418** | **0.7700** | **0.7535** | **12584** |
| **weighted avg** | **0.8427** | **0.8675** | **0.8546** | **12584** |

## πŸ›  Training Details
- **Architecture**: `BertForTokenClassification` (Large)
- **Tokenizer**: `BertTokenizerFast` (using `is_split_into_words=True`)
- **Epochs**: 5
- **Learning Rate**: 1e-5
- **Batch Size**: 4 per device (2x V100 GPUs)
- **Gradient Accumulation**: 4 steps (Effective Batch Size = 32)
- **Max Sequence Length**: 128
- **Weight Decay**: 0.01
- **Mixed Precision (FP16)**: Enabled

## πŸ“‚ Labels Mapping
The model identifies 18 entity types from OntoNotes 5.0:
`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.

## πŸ“‚ Project Assets
- **GitHub Repository**: https://github.com/Learnrr/ontonotes5_ner_evaluation.git

| **Asset** | **File** | **Description** |
| :--- | :--- | :--- |
| **Model Weights** | `model.safetensors` | Large-scale checkpoint (~1.2 GB). |
| **Configuration** | `config.json` | Model architecture & `id2label` mapping. |
| **Vocabulary** | `vocab.txt` | BERT-cased specific vocabulary. |
| **Tokenizer** | `tokenizer.json` | Optimized fast tokenizer configuration. |
| **Special Tokens** | `special_tokens_map.json` | Definitions for BOS, EOS, and Padding tokens. |
| **Training Args** | `training_args.bin` | Detailed hyperparameter dump from the Trainer. |

## πŸš€ Usage
```python
from transformers import pipeline

model_checkpoint = "learnrr/bert-large-ontonotes5-ner"
token_classifier = pipeline(
    "token-classification", 
    model=model_checkpoint, 
    aggregation_strategy="simple"
)

text = "The United Nations is headquartered in New York City."
results = token_classifier(text)

for entity in results:
    print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")
```