Create README.md

696d669 verified 21 days ago

3.43 kB

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- FacebookAI/roberta-large
	pipeline_tag: token-classification
	---

	# RoBERTa-large fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [FacebookAI/roberta-large](https://huggingface.co/FacebookAI/roberta-large) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. RoBERTa-large features 24 layers and ~355M parameters, providing enhanced semantic understanding for complex Named Entity Recognition (NER) tasks compared to the base architecture.

	## 📊 Performance
	The following results were achieved on the OntoNotes 5.0 (v12) test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7769 \| 0.7900 \| 0.7834 \| 1005 \|
	\| DATE \| 0.8211 \| 0.8533 \| 0.8369 \| 1786 \|
	\| EVENT \| 0.5702 \| 0.7647 \| 0.6533 \| 85 \|
	\| FAC \| 0.7123 \| 0.6980 \| 0.7051 \| 149 \|
	\| GPE \| 0.9262 \| 0.9470 \| 0.9365 \| 2546 \|
	\| LANGUAGE \| 0.7500 \| 0.6818 \| 0.7143 \| 22 \|
	\| LAW \| 0.5000 \| 0.6364 \| 0.5600 \| 44 \|
	\| LOC \| 0.6597 \| 0.7302 \| 0.6932 \| 215 \|
	\| MONEY \| 0.8730 \| 0.9099 \| 0.8910 \| 355 \|
	\| NORP \| 0.9029 \| 0.9485 \| 0.9251 \| 990 \|
	\| ORDINAL \| 0.6936 \| 0.7874 \| 0.7376 \| 207 \|
	\| ORG \| 0.8870 \| 0.9101 \| 0.8984 \| 2002 \|
	\| PERCENT \| 0.8703 \| 0.9066 \| 0.8881 \| 407 \|
	\| PERSON \| 0.9250 \| 0.9246 \| 0.9248 \| 2134 \|
	\| PRODUCT \| 0.7356 \| 0.7111 \| 0.7232 \| 90 \|
	\| QUANTITY \| 0.6933 \| 0.6797 \| 0.6865 \| 153 \|
	\| TIME \| 0.6211 \| 0.6267 \| 0.6239 \| 225 \|
	\| WORK_OF_ART \| 0.6686 \| 0.6923 \| 0.6802 \| 169 \|
	\| micro avg \| 0.8581 \| 0.8831 \| 0.8704 \| 12584 \|
	\| macro avg \| 0.7548 \| 0.7888 \| 0.7701 \| 12584 \|
	\| weighted avg \| 0.8596 \| 0.8831 \| 0.8710 \| 12584 \|

	## 🛠 Training Details
	To optimize the 24-layer transformer on 2xNVIDIA V100 GPUs:
	- Architecture: `RobertaForTokenClassification`
	- Tokenizer: `RobertaTokenizerFast` (with `add_prefix_space=True`)
	- Learning Rate: 1e-5
	- Effective Batch Size: 32 (4 per device × 4 gradient accumulation steps)
	- Epochs: 5
	- Warmup Ratio: 0.1
	- Mixed Precision: FP16 enabled
	- Optimizer: AdamW with `weight_decay=0.01`

	## 📂 Project Assets
	- GitHub Repository: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git)

	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Fine-tuned Large weights (~1.42 GB). \|
	\| Configuration \| `config.json` \| 24-layer configuration and `id2label` map. \|
	\| Vocabulary \| `vocab.json` / `merges.txt` \| BPE vocabulary and byte-level merge rules. \|
	\| Tokenizer \| `tokenizer.json` / `tokenizer_config.json` \| Complete fast tokenizer setup. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for BOS, EOS, and Padding tokens. \|
	\| Training Args \| `training_args.bin` \| Hyperparameters used during the training run. \|

	## 🚀 Usage
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/roberta-large-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "The United Nations is headquartered in New York City."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- FacebookAI/roberta-large
	pipeline_tag: token-classification
	---

	# RoBERTa-large fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [FacebookAI/roberta-large](https://huggingface.co/FacebookAI/roberta-large) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. RoBERTa-large features 24 layers and ~355M parameters, providing enhanced semantic understanding for complex Named Entity Recognition (NER) tasks compared to the base architecture.

	## 📊 Performance
	The following results were achieved on the OntoNotes 5.0 (v12) test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7769 \| 0.7900 \| 0.7834 \| 1005 \|
	\| DATE \| 0.8211 \| 0.8533 \| 0.8369 \| 1786 \|
	\| EVENT \| 0.5702 \| 0.7647 \| 0.6533 \| 85 \|
	\| FAC \| 0.7123 \| 0.6980 \| 0.7051 \| 149 \|
	\| GPE \| 0.9262 \| 0.9470 \| 0.9365 \| 2546 \|
	\| LANGUAGE \| 0.7500 \| 0.6818 \| 0.7143 \| 22 \|
	\| LAW \| 0.5000 \| 0.6364 \| 0.5600 \| 44 \|
	\| LOC \| 0.6597 \| 0.7302 \| 0.6932 \| 215 \|
	\| MONEY \| 0.8730 \| 0.9099 \| 0.8910 \| 355 \|
	\| NORP \| 0.9029 \| 0.9485 \| 0.9251 \| 990 \|
	\| ORDINAL \| 0.6936 \| 0.7874 \| 0.7376 \| 207 \|
	\| ORG \| 0.8870 \| 0.9101 \| 0.8984 \| 2002 \|
	\| PERCENT \| 0.8703 \| 0.9066 \| 0.8881 \| 407 \|
	\| PERSON \| 0.9250 \| 0.9246 \| 0.9248 \| 2134 \|
	\| PRODUCT \| 0.7356 \| 0.7111 \| 0.7232 \| 90 \|
	\| QUANTITY \| 0.6933 \| 0.6797 \| 0.6865 \| 153 \|
	\| TIME \| 0.6211 \| 0.6267 \| 0.6239 \| 225 \|
	\| WORK_OF_ART \| 0.6686 \| 0.6923 \| 0.6802 \| 169 \|
	\| micro avg \| 0.8581 \| 0.8831 \| 0.8704 \| 12584 \|
	\| macro avg \| 0.7548 \| 0.7888 \| 0.7701 \| 12584 \|
	\| weighted avg \| 0.8596 \| 0.8831 \| 0.8710 \| 12584 \|

	## 🛠 Training Details
	To optimize the 24-layer transformer on 2xNVIDIA V100 GPUs:
	- Architecture: `RobertaForTokenClassification`
	- Tokenizer: `RobertaTokenizerFast` (with `add_prefix_space=True`)
	- Learning Rate: 1e-5
	- Effective Batch Size: 32 (4 per device × 4 gradient accumulation steps)
	- Epochs: 5
	- Warmup Ratio: 0.1
	- Mixed Precision: FP16 enabled
	- Optimizer: AdamW with `weight_decay=0.01`

	## 📂 Project Assets
	- GitHub Repository: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git)

	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Fine-tuned Large weights (~1.42 GB). \|
	\| Configuration \| `config.json` \| 24-layer configuration and `id2label` map. \|
	\| Vocabulary \| `vocab.json` / `merges.txt` \| BPE vocabulary and byte-level merge rules. \|
	\| Tokenizer \| `tokenizer.json` / `tokenizer_config.json` \| Complete fast tokenizer setup. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for BOS, EOS, and Padding tokens. \|
	\| Training Args \| `training_args.bin` \| Hyperparameters used during the training run. \|

	## 🚀 Usage
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/roberta-large-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "The United Nations is headquartered in New York City."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```