Name-Entity-Recognition / README_NER_Model.md

Upload 7 files

4e2e3dd verified 10 months ago

3.82 kB


	# BERT-Based Named Entity Recognition (NER) Model

	This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.

	---

	## Model Details

	- Model Name: BERT-Base-Cased NER
	- Model Architecture: BERT Base
	- Task: Named Entity Recognition (NER)
	- Dataset: WNUT-17 (from Hugging Face Datasets)
	- Quantization: Float16
	- Fine-tuning Framework: Hugging Face Transformers

	---

	## Usage

	### Installation

	```bash
	pip install transformers datasets evaluate seqeval scikit-learn torch
	```

	### Training the Model

	```python
	from transformers import Trainer

	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_datasets["train"],
	eval_dataset=tokenized_datasets["validation"],
	tokenizer=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics
	)

	trainer.train()
	```

	### Saving the Model

	```python
	model.save_pretrained("./saved_model")
	tokenizer.save_pretrained("./saved_model")
	```

	### Testing the Saved Model

	```python
	from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

	model = AutoModelForTokenClassification.from_pretrained("./saved_model")
	tokenizer = AutoTokenizer.from_pretrained("./saved_model")
	ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	sample_sentences = [
	"Barack Obama visited Microsoft headquarters in Redmond.",
	"Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
	"Google is launching a new AI product in California."
	]

	for sentence in sample_sentences:
	print(f"Sentence: {sentence}")
	print(ner_pipeline(sentence))
	```

	### Quantizing the Model

	```python
	import torch

	quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
	quantized_model.save_pretrained("quantized-model")
	tokenizer.save_pretrained("quantized-model")
	```

	### Testing the Quantized Model

	```python
	model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
	tokenizer = AutoTokenizer.from_pretrained("quantized-model")
	ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
	```

	---

	## Performance Metrics

	- Accuracy: Evaluated using seqeval on the validation split
	- Precision, Recall, F1 Score: Computed using label-wise predictions excluding ignored indices

	---

	## Fine-Tuning Details

	### Dataset

	The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:
	- Tokenization using BERT tokenizer
	- Label alignment for wordpiece tokens

	### Training Configuration

	- Epochs: 3
	- Batch Size: 16
	- Learning Rate: 2e-5
	- Max Length: 128 tokens (implicitly handled by tokenizer)
	- Evaluation Strategy: Per epoch

	### Quantization

	The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.

	---

	## Repository Structure

	```
	.
	├── saved_model/ # Fine-Tuned BERT Model and Tokenizer
	├── quantized-model/ # Quantized Model for Deployment
	├── ner_output/ # Training Logs and Checkpoints
	├── README.md # Documentation
	```

	---

	## Limitations

	- May not generalize well to domains outside WNUT-17 entities
	- Quantized model may slightly reduce accuracy for faster performance

	---

	## Contributing

	Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.

	---