Create README.md

d86ddb0 verified 15 days ago

3.3 kB

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: token-classification
	---

	# RoBERTa-base fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. RoBERTa optimizes the BERT pretraining approach by using dynamic masking, removing next-sentence prediction, and training on larger batches with byte-level BPE.

	## 📊 Performance
	The following results were achieved on the OntoNotes 5.0 (v12) test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7813 \| 0.8070 \| 0.7939 \| 1005 \|
	\| DATE \| 0.8032 \| 0.8729 \| 0.8366 \| 1786 \|
	\| EVENT \| 0.5566 \| 0.6941 \| 0.6178 \| 85 \|
	\| FAC \| 0.7059 \| 0.6443 \| 0.6737 \| 149 \|
	\| GPE \| 0.9283 \| 0.9356 \| 0.9319 \| 2546 \|
	\| LANGUAGE \| 0.8667 \| 0.5909 \| 0.7027 \| 22 \|
	\| LAW \| 0.4643 \| 0.5909 \| 0.5200 \| 44 \|
	\| LOC \| 0.7354 \| 0.7628 \| 0.7489 \| 215 \|
	\| MONEY \| 0.8414 \| 0.8817 \| 0.8611 \| 355 \|
	\| NORP \| 0.9004 \| 0.9404 \| 0.9200 \| 990 \|
	\| ORDINAL \| 0.6944 \| 0.8454 \| 0.7625 \| 207 \|
	\| ORG \| 0.8653 \| 0.8986 \| 0.8816 \| 2002 \|
	\| PERCENT \| 0.8605 \| 0.9091 \| 0.8841 \| 407 \|
	\| PERSON \| 0.9083 \| 0.9236 \| 0.9159 \| 2134 \|
	\| PRODUCT \| 0.6771 \| 0.7222 \| 0.6989 \| 90 \|
	\| QUANTITY \| 0.7410 \| 0.6732 \| 0.7055 \| 153 \|
	\| TIME \| 0.5670 \| 0.6578 \| 0.6091 \| 225 \|
	\| WORK_OF_ART \| 0.6105 \| 0.6864 \| 0.6462 \| 169 \|
	\| micro avg \| 0.8471 \| 0.8822 \| 0.8643 \| 12584 \|
	\| macro avg \| 0.7504 \| 0.7798 \| 0.7617 \| 12584 \|
	\| weighted avg \| 0.8497 \| 0.8822 \| 0.8652 \| 12584 \|

	## 🛠 Training Details
	- Architecture: `RobertaForTokenClassification`
	- Tokenizer: `RobertaTokenizerFast` (with `add_prefix_space=True`)
	- Epochs: 5
	- Learning Rate: 2e-5
	- Batch Size: 16 per device (Total 32 on 2x V100 GPUs)
	- Max Sequence Length: 128
	- Weight Decay: 0.01
	- Mixed Precision (FP16): Enabled

	## 📂 Project Assets
	- GitHub Repository: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git)

	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Fine-tuned RoBERTa weights (~496 MB). \|
	\| Configuration \| `config.json` \| Model architecture & `id2label` mappings. \|
	\| Vocabulary \| `vocab.json` / `merges.txt` \| Byte-level BPE vocabulary and merge rules. \|
	\| Tokenizer \| `tokenizer.json` \| Full fast tokenizer configuration. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for BOS, EOS, and Padding tokens. \|
	\| Training Args \| `training_args.bin` \| Detailed dump of the training hyperparameters. \|

	## 🚀 Usage
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/roberta-base-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "Microsoft Corporation was founded by Bill Gates and Paul Allen."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```

	---
	license: mit
	datasets:
	- ontonotes/conll2012_ontonotesv5
	language:
	- en
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: token-classification
	---

	# RoBERTa-base fine-tuned on OntoNotes 5.0

	This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. RoBERTa optimizes the BERT pretraining approach by using dynamic masking, removing next-sentence prediction, and training on larger batches with byte-level BPE.

	## 📊 Performance
	The following results were achieved on the OntoNotes 5.0 (v12) test set:

	\| Entity \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| CARDINAL \| 0.7813 \| 0.8070 \| 0.7939 \| 1005 \|
	\| DATE \| 0.8032 \| 0.8729 \| 0.8366 \| 1786 \|
	\| EVENT \| 0.5566 \| 0.6941 \| 0.6178 \| 85 \|
	\| FAC \| 0.7059 \| 0.6443 \| 0.6737 \| 149 \|
	\| GPE \| 0.9283 \| 0.9356 \| 0.9319 \| 2546 \|
	\| LANGUAGE \| 0.8667 \| 0.5909 \| 0.7027 \| 22 \|
	\| LAW \| 0.4643 \| 0.5909 \| 0.5200 \| 44 \|
	\| LOC \| 0.7354 \| 0.7628 \| 0.7489 \| 215 \|
	\| MONEY \| 0.8414 \| 0.8817 \| 0.8611 \| 355 \|
	\| NORP \| 0.9004 \| 0.9404 \| 0.9200 \| 990 \|
	\| ORDINAL \| 0.6944 \| 0.8454 \| 0.7625 \| 207 \|
	\| ORG \| 0.8653 \| 0.8986 \| 0.8816 \| 2002 \|
	\| PERCENT \| 0.8605 \| 0.9091 \| 0.8841 \| 407 \|
	\| PERSON \| 0.9083 \| 0.9236 \| 0.9159 \| 2134 \|
	\| PRODUCT \| 0.6771 \| 0.7222 \| 0.6989 \| 90 \|
	\| QUANTITY \| 0.7410 \| 0.6732 \| 0.7055 \| 153 \|
	\| TIME \| 0.5670 \| 0.6578 \| 0.6091 \| 225 \|
	\| WORK_OF_ART \| 0.6105 \| 0.6864 \| 0.6462 \| 169 \|
	\| micro avg \| 0.8471 \| 0.8822 \| 0.8643 \| 12584 \|
	\| macro avg \| 0.7504 \| 0.7798 \| 0.7617 \| 12584 \|
	\| weighted avg \| 0.8497 \| 0.8822 \| 0.8652 \| 12584 \|

	## 🛠 Training Details
	- Architecture: `RobertaForTokenClassification`
	- Tokenizer: `RobertaTokenizerFast` (with `add_prefix_space=True`)
	- Epochs: 5
	- Learning Rate: 2e-5
	- Batch Size: 16 per device (Total 32 on 2x V100 GPUs)
	- Max Sequence Length: 128
	- Weight Decay: 0.01
	- Mixed Precision (FP16): Enabled

	## 📂 Project Assets
	- GitHub Repository: [Learnrr/ontonotes5_ner_evaluation](https://github.com/Learnrr/ontonotes5_ner_evaluation.git)

	\| Asset \| File \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Model Weights \| `model.safetensors` \| Fine-tuned RoBERTa weights (~496 MB). \|
	\| Configuration \| `config.json` \| Model architecture & `id2label` mappings. \|
	\| Vocabulary \| `vocab.json` / `merges.txt` \| Byte-level BPE vocabulary and merge rules. \|
	\| Tokenizer \| `tokenizer.json` \| Full fast tokenizer configuration. \|
	\| Special Tokens \| `special_tokens_map.json` \| Definitions for BOS, EOS, and Padding tokens. \|
	\| Training Args \| `training_args.bin` \| Detailed dump of the training hyperparameters. \|

	## 🚀 Usage
	```python
	from transformers import pipeline

	model_checkpoint = "learnrr/roberta-base-ontonotes5-ner"
	token_classifier = pipeline(
	"token-classification",
	model=model_checkpoint,
	aggregation_strategy="simple"
	)

	text = "Microsoft Corporation was founded by Bill Gates and Paul Allen."
	results = token_classifier(text)

	for entity in results:
	print(f"Entity: {entity['word']} \| Label: {entity['entity_group']} \| Score: {entity['score']:.4f}")
	```