learnrr commited on
Commit
42e3f34
ยท
verified ยท
1 Parent(s): 88a1d57

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - ontonotes/conll2012_ontonotesv5
5
+ language:
6
+ - en
7
+ base_model:
8
+ - google-bert/bert-base-cased
9
+ pipeline_tag: token-classification
10
+ ---
11
+ # BERT-base-cased fine-tuned on OntoNotes 5.0
12
+
13
+ This model is a fine-tuned version of [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) on the English subset of the **OntoNotes 5.0** (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities.
14
+
15
+ ## ๐Ÿ“Š Performance
16
+ The model achieves the following results on the OntoNotes 5.0 test set:
17
+
18
+ | **Entity** | **Precision** | **Recall** | **F1-Score** | **Support** |
19
+ | :--- | :---: | :---: | :---: | :---: |
20
+ | CARDINAL | 0.7776 | 0.8070 | 0.7920 | 1005 |
21
+ | DATE | 0.7943 | 0.8628 | 0.8272 | 1786 |
22
+ | EVENT | 0.5000 | 0.6235 | 0.5550 | 85 |
23
+ | FAC | 0.6081 | 0.6040 | 0.6061 | 149 |
24
+ | GPE | 0.9243 | 0.9156 | 0.9199 | 2546 |
25
+ | LANGUAGE | 0.7500 | 0.6818 | 0.7143 | 22 |
26
+ | LAW | 0.5200 | 0.5909 | 0.5532 | 44 |
27
+ | LOC | 0.6478 | 0.7442 | 0.6926 | 215 |
28
+ | MONEY | 0.8760 | 0.9155 | 0.8953 | 355 |
29
+ | NORP | 0.8956 | 0.9182 | 0.9067 | 990 |
30
+ | ORDINAL | 0.7252 | 0.7778 | 0.7506 | 207 |
31
+ | ORG | 0.8621 | 0.8991 | 0.8802 | 2002 |
32
+ | PERCENT | 0.8575 | 0.9017 | 0.8790 | 407 |
33
+ | PERSON | 0.9080 | 0.9161 | 0.9121 | 2134 |
34
+ | PRODUCT | 0.5918 | 0.6444 | 0.6170 | 90 |
35
+ | QUANTITY | 0.7042 | 0.6536 | 0.6780 | 153 |
36
+ | TIME | 0.5906 | 0.6667 | 0.6263 | 225 |
37
+ | WORK_OF_ART | 0.6022 | 0.6450 | 0.6229 | 169 |
38
+ | **micro avg** | **0.8413** | **0.8710** | **0.8559** | **12584** |
39
+ | **macro avg** | **0.7297** | **0.7649** | **0.7460** | **12584** |
40
+ | **weighted avg** | **0.8440** | **0.8710** | **0.8570** | **12584** |
41
+
42
+ ## ๐Ÿ›  Training Details
43
+ - **Architecture**: `BertForTokenClassification`
44
+ - **Tokenizer**: `BertTokenizerFast` (using `is_split_into_words=True` for alignment)
45
+ - **Epochs**: 5
46
+ - **Learning Rate**: 2e-5
47
+ - **Batch Size**: 16 per device (Total 32 on 2x V100 GPUs)
48
+ - **Max Sequence Length**: 128
49
+ - **Weight Decay**: 0.01
50
+ - **Mixed Precision (FP16)**: Enabled
51
+
52
+ ## ๐Ÿ“‚ Labels Mapping
53
+ The model was trained with the following label mapping (18 OntoNotes entities + BIO tags):
54
+ `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.
55
+
56
+ ## ๐Ÿ“‚ Project Assets
57
+ - **GitHub Repository**: https://github.com/Learnrr/ontonotes5_ner_evaluation.git
58
+ | **Asset** | **File** | **Description** |
59
+ | :--- | :--- | :--- |
60
+ | **Model Weights** | `model.safetensors` | Main checkpoint in Safetensors format (safe, fast loading, ~496 MB). |
61
+ | **Configuration** | `config.json` | Model architecture settings and `id2label` entity mapping. |
62
+ | **Vocabulary** | `vocab.json` / `merges.txt` | Byte-level BPE vocabulary files used for tokenization. |
63
+ | **Tokenizer** | `tokenizer.json` | Full fast tokenizer configuration for production inference. |
64
+ | **Special Tokens** | `special_tokens_map.json` | Mapping for tokens like `[CLS]`, `[SEP]`, `<mask>`, etc. |
65
+ | **Training Args** | `training_args.bin` | Binary file containing all hyperparameter settings used during training. |
66
+
67
+ ## ๐Ÿš€ Usage
68
+ You can use this model directly with a pipeline for token classification:
69
+ ```python
70
+ from transformers import pipeline
71
+
72
+ model_checkpoint = "learnrr/bert-base-ontonotes5-ner"
73
+ token_classifier = pipeline(
74
+ "token-classification",
75
+ model=model_checkpoint,
76
+ aggregation_strategy="simple"
77
+ )
78
+
79
+ text = "Apple was founded by Steve Jobs in Cupertino."
80
+ results = token_classifier(text)
81
+
82
+ for entity in results:
83
+ print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")
84
+ ```