| # BERT-Based Named Entity Recognition (NER) Model | |
| This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments. | |
| --- | |
| ## Model Details | |
| - **Model Name:** BERT-Base-Cased NER | |
| - **Model Architecture:** BERT Base | |
| - **Task:** Named Entity Recognition (NER) | |
| - **Dataset:** WNUT-17 (from Hugging Face Datasets) | |
| - **Quantization:** Float16 | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| --- | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers datasets evaluate seqeval scikit-learn torch | |
| ``` | |
| ### Training the Model | |
| ```python | |
| from transformers import Trainer | |
| trainer = Trainer( | |
| model=model, | |
| args=training_args, | |
| train_dataset=tokenized_datasets["train"], | |
| eval_dataset=tokenized_datasets["validation"], | |
| tokenizer=tokenizer, | |
| data_collator=data_collator, | |
| compute_metrics=compute_metrics | |
| ) | |
| trainer.train() | |
| ``` | |
| ### Saving the Model | |
| ```python | |
| model.save_pretrained("./saved_model") | |
| tokenizer.save_pretrained("./saved_model") | |
| ``` | |
| ### Testing the Saved Model | |
| ```python | |
| from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification | |
| model = AutoModelForTokenClassification.from_pretrained("./saved_model") | |
| tokenizer = AutoTokenizer.from_pretrained("./saved_model") | |
| ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") | |
| sample_sentences = [ | |
| "Barack Obama visited Microsoft headquarters in Redmond.", | |
| "Nancy Gautam lives in Faridabad and studies at J.C. Bose University.", | |
| "Google is launching a new AI product in California." | |
| ] | |
| for sentence in sample_sentences: | |
| print(f"Sentence: {sentence}") | |
| print(ner_pipeline(sentence)) | |
| ``` | |
| ### Quantizing the Model | |
| ```python | |
| import torch | |
| quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu") | |
| quantized_model.save_pretrained("quantized-model") | |
| tokenizer.save_pretrained("quantized-model") | |
| ``` | |
| ### Testing the Quantized Model | |
| ```python | |
| model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16) | |
| tokenizer = AutoTokenizer.from_pretrained("quantized-model") | |
| ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") | |
| ``` | |
| --- | |
| ## Performance Metrics | |
| - **Accuracy:** Evaluated using seqeval on the validation split | |
| - **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices | |
| --- | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes: | |
| - Tokenization using BERT tokenizer | |
| - Label alignment for wordpiece tokens | |
| ### Training Configuration | |
| - **Epochs:** 3 | |
| - **Batch Size:** 16 | |
| - **Learning Rate:** 2e-5 | |
| - **Max Length:** 128 tokens (implicitly handled by tokenizer) | |
| - **Evaluation Strategy:** Per epoch | |
| ### Quantization | |
| The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time. | |
| --- | |
| ## Repository Structure | |
| ``` | |
| . | |
| βββ saved_model/ # Fine-Tuned BERT Model and Tokenizer | |
| βββ quantized-model/ # Quantized Model for Deployment | |
| βββ ner_output/ # Training Logs and Checkpoints | |
| βββ README.md # Documentation | |
| ``` | |
| --- | |
| ## Limitations | |
| - May not generalize well to domains outside WNUT-17 entities | |
| - Quantized model may slightly reduce accuracy for faster performance | |
| --- | |
| ## Contributing | |
| Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions. | |
| --- | |