Update README.md

a232c87 verified 16 days ago

3.49 kB

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- google-bert/bert-large-cased
	pipeline_tag: token-classification
	---

	# BERT-large-cased fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [google-bert/bert-large-cased](https://huggingface.co/google-bert/bert-large-cased) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. It is optimized for high-precision Named Entity Recognition (NER) across 18 entity categories.

	## 📊 Performance
	The model achieves the following results on the OntoNotes 5.0 test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7813 \| 0.7891 \| 0.7851 \| 1005 \|
	\| DATE \| 0.7988 \| 0.8516 \| 0.8244 \| 1786 \|
	\| EVENT \| 0.5619 \| 0.6941 \| 0.6211 \| 85 \|
	\| FAC \| 0.6880 \| 0.5772 \| 0.6277 \| 149 \|
	\| GPE \| 0.9185 \| 0.9207 \| 0.9196 \| 2546 \|
	\| LANGUAGE \| 0.8421 \| 0.7273 \| 0.7805 \| 22 \|
	\| LAW \| 0.4762 \| 0.6818 \| 0.5607 \| 44 \|
	\| LOC \| 0.6337 \| 0.7163 \| 0.6725 \| 215 \|
	\| MONEY \| 0.8636 \| 0.9099 \| 0.8861 \| 355 \|
	\| NORP \| 0.8481 \| 0.8909 \| 0.8690 \| 990 \|
	\| ORDINAL \| 0.7054 \| 0.7633 \| 0.7332 \| 207 \|
	\| ORG \| 0.8690 \| 0.9046 \| 0.8864 \| 2002 \|
	\| PERCENT \| 0.8467 \| 0.8821 \| 0.8640 \| 407 \|
	\| PERSON \| 0.9090 \| 0.9217 \| 0.9153 \| 2134 \|
	\| PRODUCT \| 0.6667 \| 0.6889 \| 0.6776 \| 90 \|
	\| QUANTITY \| 0.6972 \| 0.6471 \| 0.6712 \| 153 \|
	\| TIME \| 0.6106 \| 0.6133 \| 0.6120 \| 225 \|
	\| WORK_OF_ART \| 0.6354 \| 0.6805 \| 0.6571 \| 169 \|
	\| micro avg \| 0.8412 \| 0.8675 \| 0.8542 \| 12584 \|
	\| macro avg \| 0.7418 \| 0.7700 \| 0.7535 \| 12584 \|
	\| weighted avg \| 0.8427 \| 0.8675 \| 0.8546 \| 12584 \|

	## 🛠 Training Details
	- Architecture: `BertForTokenClassification` (Large)
	- Tokenizer: `BertTokenizerFast` (using `is_split_into_words=True`)
	- Epochs: 5
	- Learning Rate: 1e-5
	- Batch Size: 4 per device (2x V100 GPUs)
	- Gradient Accumulation: 4 steps (Effective Batch Size = 32)
	- Max Sequence Length: 128
	- Weight Decay: 0.01
	- Mixed Precision (FP16): Enabled

	## 📂 Labels Mapping
	The model identifies 18 entity types from OntoNotes 5.0:
	`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.

	## 📂 Project Assets
	- GitHub Repository: https://github.com/Learnrr/ontonotes5_ner_evaluation.git

	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Large-scale checkpoint (~1.2 GB). \|
	\| Configuration \| `config.json` \| Model architecture & `id2label` mapping. \|
	\| Vocabulary \| `vocab.txt` \| BERT-cased specific vocabulary. \|
	\| Tokenizer \| `tokenizer.json` \| Optimized fast tokenizer configuration. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for BOS, EOS, and Padding tokens. \|
	\| Training Args \| `training_args.bin` \| Detailed hyperparameter dump from the Trainer. \|

	## 🚀 Usage
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/bert-large-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "The United Nations is headquartered in New York City."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- google-bert/bert-large-cased
	pipeline_tag: token-classification
	---

	# BERT-large-cased fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [google-bert/bert-large-cased](https://huggingface.co/google-bert/bert-large-cased) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. It is optimized for high-precision Named Entity Recognition (NER) across 18 entity categories.

	## 📊 Performance
	The model achieves the following results on the OntoNotes 5.0 test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7813 \| 0.7891 \| 0.7851 \| 1005 \|
	\| DATE \| 0.7988 \| 0.8516 \| 0.8244 \| 1786 \|
	\| EVENT \| 0.5619 \| 0.6941 \| 0.6211 \| 85 \|
	\| FAC \| 0.6880 \| 0.5772 \| 0.6277 \| 149 \|
	\| GPE \| 0.9185 \| 0.9207 \| 0.9196 \| 2546 \|
	\| LANGUAGE \| 0.8421 \| 0.7273 \| 0.7805 \| 22 \|
	\| LAW \| 0.4762 \| 0.6818 \| 0.5607 \| 44 \|
	\| LOC \| 0.6337 \| 0.7163 \| 0.6725 \| 215 \|
	\| MONEY \| 0.8636 \| 0.9099 \| 0.8861 \| 355 \|
	\| NORP \| 0.8481 \| 0.8909 \| 0.8690 \| 990 \|
	\| ORDINAL \| 0.7054 \| 0.7633 \| 0.7332 \| 207 \|
	\| ORG \| 0.8690 \| 0.9046 \| 0.8864 \| 2002 \|
	\| PERCENT \| 0.8467 \| 0.8821 \| 0.8640 \| 407 \|
	\| PERSON \| 0.9090 \| 0.9217 \| 0.9153 \| 2134 \|
	\| PRODUCT \| 0.6667 \| 0.6889 \| 0.6776 \| 90 \|
	\| QUANTITY \| 0.6972 \| 0.6471 \| 0.6712 \| 153 \|
	\| TIME \| 0.6106 \| 0.6133 \| 0.6120 \| 225 \|
	\| WORK_OF_ART \| 0.6354 \| 0.6805 \| 0.6571 \| 169 \|
	\| micro avg \| 0.8412 \| 0.8675 \| 0.8542 \| 12584 \|
	\| macro avg \| 0.7418 \| 0.7700 \| 0.7535 \| 12584 \|
	\| weighted avg \| 0.8427 \| 0.8675 \| 0.8546 \| 12584 \|

	## 🛠 Training Details
	- Architecture: `BertForTokenClassification` (Large)
	- Tokenizer: `BertTokenizerFast` (using `is_split_into_words=True`)
	- Epochs: 5
	- Learning Rate: 1e-5
	- Batch Size: 4 per device (2x V100 GPUs)
	- Gradient Accumulation: 4 steps (Effective Batch Size = 32)
	- Max Sequence Length: 128
	- Weight Decay: 0.01
	- Mixed Precision (FP16): Enabled

	## 📂 Labels Mapping
	The model identifies 18 entity types from OntoNotes 5.0:
	`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`.

	## 📂 Project Assets
	- GitHub Repository: https://github.com/Learnrr/ontonotes5_ner_evaluation.git

	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Large-scale checkpoint (~1.2 GB). \|
	\| Configuration \| `config.json` \| Model architecture & `id2label` mapping. \|
	\| Vocabulary \| `vocab.txt` \| BERT-cased specific vocabulary. \|
	\| Tokenizer \| `tokenizer.json` \| Optimized fast tokenizer configuration. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for BOS, EOS, and Padding tokens. \|
	\| Training Args \| `training_args.bin` \| Detailed hyperparameter dump from the Trainer. \|

	## 🚀 Usage
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/bert-large-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "The United Nations is headquartered in New York City."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```