File size: 4,915 Bytes
cae568a 95f81a4 cae568a 95f81a4 cae568a 95f81a4 cae568a 95f81a4 cae568a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
# Model Overview
This model is a Named Entity Recognition (NER) system fine-tuned on the WNUT 17 dataset using DistilBERT. It predicts entity types such as persons, organizations, locations, and more from text.
# Model Details
```
Model Type: Transformer-based NER
Base Model: DistilBERT (distilbert-base-uncased)
Dataset: WNUT 17
Training Framework: PyTorch & Hugging Face Transformers
Training Epochs: 3
Batch Size: 16
Learning Rate: 2e-5
Optimizer: AdamW
Weight Decay: 0.01
Evaluation Strategy: Per epoch
```
# Training Data
The model is trained on the WNUT 17 dataset, which contains challenging named entities in social media and conversational text. The dataset provides annotations for named entity recognition, including entity categories such as:
```
person
location
corporation
product
creative-work
group
```
# Inference and Usage
```python
#Loading the Model
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
import torch
model_name = "AventIQ-AI/distilbert-base-uncased_token_classification"
def predict_entities(text, model, tokenizer):
"""Predict Named Entities from the quantized model"""
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Convert to FP32 if needed (for stability)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits.float(), dim=2) # Convert logits to float32
predicted_labels = [model.config.id2label[t.item()] for t in predictions[0]]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Remove special tokens and align subwords
entities = []
current_entity = None
for token, label in zip(tokens, predicted_labels):
if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
continue
if token.startswith("##"): # Handle subwords
if current_entity:
current_entity["text"] += token[2:]
continue
if label == "O":
if current_entity:
entities.append(current_entity)
current_entity = None
else:
if label.startswith("B-"):
if current_entity:
entities.append(current_entity)
current_entity = {"text": token, "type": label[2:]}
elif label.startswith("I-") and current_entity:
current_entity["text"] += " " + token
if current_entity:
entities.append(current_entity)
return entities
```
# Example Usage
```python
test_sentence = ["Apple CEO Tim Cook announced the new iPhone 14 at their headquarters in Cupertino."]
for sentence in test_sentences:
print(f"\nInput: {sentence}")
entities = predict_entities(sentence, model, tokenizer)
print("Detected entities:")
for entity in entities:
print(f"- {entity['text']} ({entity['type']})")
print("-" * 50)
```
## π Evaluation Results for Quantized Model
### **πΉ Overall Performance**
- **Accuracy**: **97.10%** β
- **Precision**: **89.52%**
- **Recall**: **90.67%**
- **F1 Score**: **90.09%**
---
### **πΉ Performance by Entity Type**
| Entity Type | Precision | Recall | F1 Score | Number of Entities |
|------------|-----------|--------|----------|--------------------|
| **LOC** (Location) | **91.46%** | **92.07%** | **91.76%** | 3,000 |
| **MISC** (Miscellaneous) | **71.25%** | **72.83%** | **72.03%** | 1,266 |
| **ORG** (Organization) | **89.83%** | **93.02%** | **91.40%** | 3,524 |
| **PER** (Person) | **95.16%** | **94.04%** | **94.60%** | 2,989 |
---
#### β³ **Inference Speed Metrics**
- **Total Evaluation Time**: 15.89 sec
- **Samples Processed per Second**: 217.26
- **Steps per Second**: 27.18
- **Epochs Completed**: 3
---
## Fine-Tuning Details
### Dataset
The Hugging Face's `wnut_17` dataset was used, containing texts and their ner tags.
## π Training Details
- **Number of epochs**: 3
- **Batch size**: 16
- **Evaluation strategy**: epoch
- **Learning Rate**: 2e-5
### β‘ Quantization
Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
---
## π Repository Structure
```
.
βββ model/ # Contains the quantized model files
βββ tokenizer_config/ # Tokenizer configuration and vocabulary files
βββ model.safetensors/ # Quantized Model
βββ README.md # Model documentation
```
---
## β οΈ Limitations
- The model may not generalize well to domains outside the fine-tuning dataset.
- Quantization may result in minor accuracy degradation compared to full-precision models.
---
## π€ Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
|