Update README.md

720c34b verified 12 days ago

3.67 kB

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- google-bert/bert-base-cased
	pipeline_tag: token-classification
	---
	# BERT-base-cased fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities.

	## 📊 Performance
	The model achieves the following results on the OntoNotes 5.0 test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7776 \| 0.8070 \| 0.7920 \| 1005 \|
	\| DATE \| 0.7943 \| 0.8628 \| 0.8272 \| 1786 \|
	\| EVENT \| 0.5000 \| 0.6235 \| 0.5550 \| 85 \|
	\| FAC \| 0.6081 \| 0.6040 \| 0.6061 \| 149 \|
	\| GPE \| 0.9243 \| 0.9156 \| 0.9199 \| 2546 \|
	\| LANGUAGE \| 0.7500 \| 0.6818 \| 0.7143 \| 22 \|
	\| LAW \| 0.5200 \| 0.5909 \| 0.5532 \| 44 \|
	\| LOC \| 0.6478 \| 0.7442 \| 0.6926 \| 215 \|
	\| MONEY \| 0.8760 \| 0.9155 \| 0.8953 \| 355 \|
	\| NORP \| 0.8956 \| 0.9182 \| 0.9067 \| 990 \|
	\| ORDINAL \| 0.7252 \| 0.7778 \| 0.7506 \| 207 \|
	\| ORG \| 0.8621 \| 0.8991 \| 0.8802 \| 2002 \|
	\| PERCENT \| 0.8575 \| 0.9017 \| 0.8790 \| 407 \|
	\| PERSON \| 0.9080 \| 0.9161 \| 0.9121 \| 2134 \|
	\| PRODUCT \| 0.5918 \| 0.6444 \| 0.6170 \| 90 \|
	\| QUANTITY \| 0.7042 \| 0.6536 \| 0.6780 \| 153 \|
	\| TIME \| 0.5906 \| 0.6667 \| 0.6263 \| 225 \|
	\| WORK_OF_ART \| 0.6022 \| 0.6450 \| 0.6229 \| 169 \|
	\| micro avg \| 0.8413 \| 0.8710 \| 0.8559 \| 12584 \|
	\| macro avg \| 0.7297 \| 0.7649 \| 0.7460 \| 12584 \|
	\| weighted avg \| 0.8440 \| 0.8710 \| 0.8570 \| 12584 \|

	## 🛠 Training Details
	- Architecture: `BertForTokenClassification`
	- Tokenizer: `BertTokenizerFast` (using `is_split_into_words=True` for alignment)
	- Epochs: 5
	- Learning Rate: 2e-5
	- Batch Size: 16 per device (Total 32 on 2x V100 GPUs)
	- Max Sequence Length: 128
	- Weight Decay: 0.01
	- Mixed Precision (FP16): Enabled

	## 📂 Labels Mapping
	The model was trained with the following label mapping (18 OntoNotes entities + BIO tags):
	`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.

	## 📂 Project Assets
	- GitHub Repository: https://github.com/Learnrr/ontonotes5_ner_evaluation.git
	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Main checkpoint in Safetensors format (safe, fast loading, ~431 MB). \|
	\| Configuration \| `config.json` \| Model architecture settings and `id2label` entity mapping. \|
	\| Vocabulary \| `vocab.txt` \| BERT-cased WordPiece vocabulary for tokenization. \|
	\| Tokenizer \| `tokenizer.json` / `tokenizer_config.json` \| Optimized fast tokenizer configuration and serialization. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for special tokens like `[CLS]`, `[SEP]`, etc. \|
	\| Training Args \| `training_args.bin` \| Detailed hyperparameter settings used during the training run. \|

	## 🚀 Usage
	You can use this model directly with a pipeline for token classification:
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/bert-base-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "Apple was founded by Steve Jobs in Cupertino."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- google-bert/bert-base-cased
	pipeline_tag: token-classification
	---
	# BERT-base-cased fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities.

	## 📊 Performance
	The model achieves the following results on the OntoNotes 5.0 test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7776 \| 0.8070 \| 0.7920 \| 1005 \|
	\| DATE \| 0.7943 \| 0.8628 \| 0.8272 \| 1786 \|
	\| EVENT \| 0.5000 \| 0.6235 \| 0.5550 \| 85 \|
	\| FAC \| 0.6081 \| 0.6040 \| 0.6061 \| 149 \|
	\| GPE \| 0.9243 \| 0.9156 \| 0.9199 \| 2546 \|
	\| LANGUAGE \| 0.7500 \| 0.6818 \| 0.7143 \| 22 \|
	\| LAW \| 0.5200 \| 0.5909 \| 0.5532 \| 44 \|
	\| LOC \| 0.6478 \| 0.7442 \| 0.6926 \| 215 \|
	\| MONEY \| 0.8760 \| 0.9155 \| 0.8953 \| 355 \|
	\| NORP \| 0.8956 \| 0.9182 \| 0.9067 \| 990 \|
	\| ORDINAL \| 0.7252 \| 0.7778 \| 0.7506 \| 207 \|
	\| ORG \| 0.8621 \| 0.8991 \| 0.8802 \| 2002 \|
	\| PERCENT \| 0.8575 \| 0.9017 \| 0.8790 \| 407 \|
	\| PERSON \| 0.9080 \| 0.9161 \| 0.9121 \| 2134 \|
	\| PRODUCT \| 0.5918 \| 0.6444 \| 0.6170 \| 90 \|
	\| QUANTITY \| 0.7042 \| 0.6536 \| 0.6780 \| 153 \|
	\| TIME \| 0.5906 \| 0.6667 \| 0.6263 \| 225 \|
	\| WORK_OF_ART \| 0.6022 \| 0.6450 \| 0.6229 \| 169 \|
	\| micro avg \| 0.8413 \| 0.8710 \| 0.8559 \| 12584 \|
	\| macro avg \| 0.7297 \| 0.7649 \| 0.7460 \| 12584 \|
	\| weighted avg \| 0.8440 \| 0.8710 \| 0.8570 \| 12584 \|

	## 🛠 Training Details
	- Architecture: `BertForTokenClassification`
	- Tokenizer: `BertTokenizerFast` (using `is_split_into_words=True` for alignment)
	- Epochs: 5
	- Learning Rate: 2e-5
	- Batch Size: 16 per device (Total 32 on 2x V100 GPUs)
	- Max Sequence Length: 128
	- Weight Decay: 0.01
	- Mixed Precision (FP16): Enabled

	## 📂 Labels Mapping
	The model was trained with the following label mapping (18 OntoNotes entities + BIO tags):
	`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.

	## 📂 Project Assets
	- GitHub Repository: https://github.com/Learnrr/ontonotes5_ner_evaluation.git
	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Main checkpoint in Safetensors format (safe, fast loading, ~431 MB). \|
	\| Configuration \| `config.json` \| Model architecture settings and `id2label` entity mapping. \|
	\| Vocabulary \| `vocab.txt` \| BERT-cased WordPiece vocabulary for tokenization. \|
	\| Tokenizer \| `tokenizer.json` / `tokenizer_config.json` \| Optimized fast tokenizer configuration and serialization. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for special tokens like `[CLS]`, `[SEP]`, etc. \|
	\| Training Args \| `training_args.bin` \| Detailed hyperparameter settings used during the training run. \|

	## 🚀 Usage
	You can use this model directly with a pipeline for token classification:
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/bert-base-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "Apple was founded by Steve Jobs in Cupertino."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```