# BERT-Based Named Entity Recognition (NER) Model This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments. --- ## Model Details - **Model Name:** BERT-Base-Cased NER - **Model Architecture:** BERT Base - **Task:** Named Entity Recognition (NER) - **Dataset:** WNUT-17 (from Hugging Face Datasets) - **Quantization:** Float16 - **Fine-tuning Framework:** Hugging Face Transformers --- ## Usage ### Installation ```bash pip install transformers datasets evaluate seqeval scikit-learn torch ``` ### Training the Model ```python from transformers import Trainer trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics ) trainer.train() ``` ### Saving the Model ```python model.save_pretrained("./saved_model") tokenizer.save_pretrained("./saved_model") ``` ### Testing the Saved Model ```python from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("./saved_model") tokenizer = AutoTokenizer.from_pretrained("./saved_model") ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") sample_sentences = [ "Barack Obama visited Microsoft headquarters in Redmond.", "Nancy Gautam lives in Faridabad and studies at J.C. Bose University.", "Google is launching a new AI product in California." ] for sentence in sample_sentences: print(f"Sentence: {sentence}") print(ner_pipeline(sentence)) ``` ### Quantizing the Model ```python import torch quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu") quantized_model.save_pretrained("quantized-model") tokenizer.save_pretrained("quantized-model") ``` ### Testing the Quantized Model ```python model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained("quantized-model") ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") ``` --- ## Performance Metrics - **Accuracy:** Evaluated using seqeval on the validation split - **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices --- ## Fine-Tuning Details ### Dataset The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes: - Tokenization using BERT tokenizer - Label alignment for wordpiece tokens ### Training Configuration - **Epochs:** 3 - **Batch Size:** 16 - **Learning Rate:** 2e-5 - **Max Length:** 128 tokens (implicitly handled by tokenizer) - **Evaluation Strategy:** Per epoch ### Quantization The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time. --- ## Repository Structure ``` . ├── saved_model/ # Fine-Tuned BERT Model and Tokenizer ├── quantized-model/ # Quantized Model for Deployment ├── ner_output/ # Training Logs and Checkpoints ├── README.md # Documentation ``` --- ## Limitations - May not generalize well to domains outside WNUT-17 entities - Quantized model may slightly reduce accuracy for faster performance --- ## Contributing Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions. ---