| # Model Overview | |
| This model is a Named Entity Recognition (NER) system fine-tuned on the WNUT 17 dataset using DistilBERT. It predicts entity types such as persons, organizations, locations, and more from text. | |
| # Model Details | |
| ``` | |
| Model Type: Transformer-based NER | |
| Base Model: DistilBERT (distilbert-base-uncased) | |
| Dataset: WNUT 17 | |
| Training Framework: PyTorch & Hugging Face Transformers | |
| Training Epochs: 3 | |
| Batch Size: 16 | |
| Learning Rate: 2e-5 | |
| Optimizer: AdamW | |
| Weight Decay: 0.01 | |
| Evaluation Strategy: Per epoch | |
| ``` | |
| # Training Data | |
| The model is trained on the WNUT 17 dataset, which contains challenging named entities in social media and conversational text. The dataset provides annotations for named entity recognition, including entity categories such as: | |
| ``` | |
| person | |
| location | |
| corporation | |
| product | |
| creative-work | |
| group | |
| ``` | |
| # Inference and Usage | |
| ```python | |
| #Loading the Model | |
| from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast | |
| import torch | |
| model_name = "AventIQ-AI/distilbert-base-uncased_token_classification" | |
| def predict_entities(text, model, tokenizer): | |
| """Predict Named Entities from the quantized model""" | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) | |
| # Convert to FP32 if needed (for stability) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predictions = torch.argmax(outputs.logits.float(), dim=2) # Convert logits to float32 | |
| predicted_labels = [model.config.id2label[t.item()] for t in predictions[0]] | |
| tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) | |
| # Remove special tokens and align subwords | |
| entities = [] | |
| current_entity = None | |
| for token, label in zip(tokens, predicted_labels): | |
| if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]: | |
| continue | |
| if token.startswith("##"): # Handle subwords | |
| if current_entity: | |
| current_entity["text"] += token[2:] | |
| continue | |
| if label == "O": | |
| if current_entity: | |
| entities.append(current_entity) | |
| current_entity = None | |
| else: | |
| if label.startswith("B-"): | |
| if current_entity: | |
| entities.append(current_entity) | |
| current_entity = {"text": token, "type": label[2:]} | |
| elif label.startswith("I-") and current_entity: | |
| current_entity["text"] += " " + token | |
| if current_entity: | |
| entities.append(current_entity) | |
| return entities | |
| ``` | |
| # Example Usage | |
| ```python | |
| test_sentence = ["Apple CEO Tim Cook announced the new iPhone 14 at their headquarters in Cupertino."] | |
| for sentence in test_sentences: | |
| print(f"\nInput: {sentence}") | |
| entities = predict_entities(sentence, model, tokenizer) | |
| print("Detected entities:") | |
| for entity in entities: | |
| print(f"- {entity['text']} ({entity['type']})") | |
| print("-" * 50) | |
| ``` | |
| ## π Evaluation Results for Quantized Model | |
| ### **πΉ Overall Performance** | |
| - **Accuracy**: **97.10%** β | |
| - **Precision**: **89.52%** | |
| - **Recall**: **90.67%** | |
| - **F1 Score**: **90.09%** | |
| --- | |
| ### **πΉ Performance by Entity Type** | |
| | Entity Type | Precision | Recall | F1 Score | Number of Entities | | |
| |------------|-----------|--------|----------|--------------------| | |
| | **LOC** (Location) | **91.46%** | **92.07%** | **91.76%** | 3,000 | | |
| | **MISC** (Miscellaneous) | **71.25%** | **72.83%** | **72.03%** | 1,266 | | |
| | **ORG** (Organization) | **89.83%** | **93.02%** | **91.40%** | 3,524 | | |
| | **PER** (Person) | **95.16%** | **94.04%** | **94.60%** | 2,989 | | |
| --- | |
| #### β³ **Inference Speed Metrics** | |
| - **Total Evaluation Time**: 15.89 sec | |
| - **Samples Processed per Second**: 217.26 | |
| - **Steps per Second**: 27.18 | |
| - **Epochs Completed**: 3 | |
| --- | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The Hugging Face's `wnut_17` dataset was used, containing texts and their ner tags. | |
| ## π Training Details | |
| - **Number of epochs**: 3 | |
| - **Batch size**: 16 | |
| - **Evaluation strategy**: epoch | |
| - **Learning Rate**: 2e-5 | |
| ### β‘ Quantization | |
| Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency. | |
| --- | |
| ## π Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Contains the quantized model files | |
| βββ tokenizer_config/ # Tokenizer configuration and vocabulary files | |
| βββ model.safetensors/ # Quantized Model | |
| βββ README.md # Model documentation | |
| ``` | |
| --- | |
| ## β οΈ Limitations | |
| - The model may not generalize well to domains outside the fine-tuning dataset. | |
| - Quantization may result in minor accuracy degradation compared to full-precision models. | |
| --- | |
| ## π€ Contributing | |
| Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements. | |