File size: 4,915 Bytes
cae568a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95f81a4
cae568a
 
 
 
95f81a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cae568a
 
 
95f81a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cae568a
95f81a4
 
 
cae568a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

# Model Overview

This model is a Named Entity Recognition (NER) system fine-tuned on the WNUT 17 dataset using DistilBERT. It predicts entity types such as persons, organizations, locations, and more from text.

# Model Details
```
Model Type: Transformer-based NER

Base Model: DistilBERT (distilbert-base-uncased)

Dataset: WNUT 17

Training Framework: PyTorch & Hugging Face Transformers

Training Epochs: 3

Batch Size: 16

Learning Rate: 2e-5

Optimizer: AdamW

Weight Decay: 0.01

Evaluation Strategy: Per epoch
```

# Training Data

The model is trained on the WNUT 17 dataset, which contains challenging named entities in social media and conversational text. The dataset provides annotations for named entity recognition, including entity categories such as:
```
person

location

corporation

product

creative-work

group
```

# Inference and Usage


```python
#Loading the Model

from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
import torch

model_name = "AventIQ-AI/distilbert-base-uncased_token_classification"

def predict_entities(text, model, tokenizer):
    """Predict Named Entities from the quantized model"""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    # Convert to FP32 if needed (for stability)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits.float(), dim=2)  # Convert logits to float32
    
    predicted_labels = [model.config.id2label[t.item()] for t in predictions[0]]
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    # Remove special tokens and align subwords
    entities = []
    current_entity = None

    for token, label in zip(tokens, predicted_labels):
        if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
            continue
            
        if token.startswith("##"):  # Handle subwords
            if current_entity:
                current_entity["text"] += token[2:]
            continue
            
        if label == "O":
            if current_entity:
                entities.append(current_entity)
                current_entity = None
        else:
            if label.startswith("B-"):
                if current_entity:
                    entities.append(current_entity)
                current_entity = {"text": token, "type": label[2:]}
            elif label.startswith("I-") and current_entity:
                current_entity["text"] += " " + token

    if current_entity:
        entities.append(current_entity)

    return entities
```

# Example Usage
```python
test_sentence = ["Apple CEO Tim Cook announced the new iPhone 14 at their headquarters in Cupertino."]
for sentence in test_sentences:
    print(f"\nInput: {sentence}")
    entities = predict_entities(sentence, model, tokenizer)
    print("Detected entities:")
    for entity in entities:
        print(f"- {entity['text']} ({entity['type']})")
    print("-" * 50)
```

## πŸ“Š Evaluation Results for Quantized Model
 
### **πŸ”Ή Overall Performance**
 
- **Accuracy**: **97.10%** βœ…
- **Precision**: **89.52%**
- **Recall**: **90.67%**
- **F1 Score**: **90.09%**
 
---
 
### **πŸ”Ή Performance by Entity Type**
 
| Entity Type | Precision | Recall | F1 Score | Number of Entities |
|------------|-----------|--------|----------|--------------------|
| **LOC** (Location) | **91.46%** | **92.07%** | **91.76%** | 3,000 |
| **MISC** (Miscellaneous) | **71.25%** | **72.83%** | **72.03%** | 1,266 |
| **ORG** (Organization) | **89.83%** | **93.02%** | **91.40%** | 3,524 |
| **PER** (Person) | **95.16%** | **94.04%** | **94.60%** | 2,989 |

 ---
#### ⏳ **Inference Speed Metrics**
- **Total Evaluation Time**: 15.89 sec
- **Samples Processed per Second**: 217.26
- **Steps per Second**: 27.18
- **Epochs Completed**: 3
 
---
## Fine-Tuning Details
### Dataset
The Hugging Face's `wnut_17` dataset was used, containing texts and their ner tags.

## πŸ“Š Training Details
- **Number of epochs**: 3  
- **Batch size**: 16 
- **Evaluation strategy**: epoch
- **Learning Rate**: 2e-5

### ⚑ Quantization
Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.

---
## πŸ“‚ Repository Structure
```
.
β”œβ”€β”€ model/               # Contains the quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer configuration and vocabulary files
β”œβ”€β”€ model.safetensors/   # Quantized Model
β”œβ”€β”€ README.md            # Model documentation
```

---
## ⚠️ Limitations
- The model may not generalize well to domains outside the fine-tuning dataset.
- Quantization may result in minor accuracy degradation compared to full-precision models.

---
## 🀝 Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.