YashikaNagpal commited on
Commit
95f81a4
Β·
verified Β·
1 Parent(s): 0c68ec5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -12
README.md CHANGED
@@ -47,24 +47,125 @@ group
47
 
48
 
49
  ```python
50
- Loading the Model
51
 
52
  from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
53
  import torch
54
 
55
- def load_model_and_tokenizer(model_path):
56
- tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
57
- model = DistilBertForTokenClassification.from_pretrained(model_path)
58
- return model, tokenizer
59
-
60
- model_path = "./final_model"
61
- model, tokenizer = load_model_and_tokenizer(model_path)
62
- model.eval()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ```
64
 
65
  # Example Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
- test_sentence = "Apple CEO Tim Cook announced the new iPhone 14 at their headquarters in Cupertino."
68
- entities = predict_entities(test_sentence, model, tokenizer)
69
- print(entities)
70
 
 
47
 
48
 
49
  ```python
50
+ #Loading the Model
51
 
52
  from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
53
  import torch
54
 
55
+ model_name = "AventIQ-AI/distilbert-base-uncased_token_classification"
56
+
57
+ def predict_entities(text, model, tokenizer):
58
+ """Predict Named Entities from the quantized model"""
59
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
60
+
61
+ # Convert to FP32 if needed (for stability)
62
+ with torch.no_grad():
63
+ outputs = model(**inputs)
64
+ predictions = torch.argmax(outputs.logits.float(), dim=2) # Convert logits to float32
65
+
66
+ predicted_labels = [model.config.id2label[t.item()] for t in predictions[0]]
67
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
68
+
69
+ # Remove special tokens and align subwords
70
+ entities = []
71
+ current_entity = None
72
+
73
+ for token, label in zip(tokens, predicted_labels):
74
+ if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
75
+ continue
76
+
77
+ if token.startswith("##"): # Handle subwords
78
+ if current_entity:
79
+ current_entity["text"] += token[2:]
80
+ continue
81
+
82
+ if label == "O":
83
+ if current_entity:
84
+ entities.append(current_entity)
85
+ current_entity = None
86
+ else:
87
+ if label.startswith("B-"):
88
+ if current_entity:
89
+ entities.append(current_entity)
90
+ current_entity = {"text": token, "type": label[2:]}
91
+ elif label.startswith("I-") and current_entity:
92
+ current_entity["text"] += " " + token
93
+
94
+ if current_entity:
95
+ entities.append(current_entity)
96
+
97
+ return entities
98
  ```
99
 
100
  # Example Usage
101
+ ```python
102
+ test_sentence = ["Apple CEO Tim Cook announced the new iPhone 14 at their headquarters in Cupertino."]
103
+ for sentence in test_sentences:
104
+ print(f"\nInput: {sentence}")
105
+ entities = predict_entities(sentence, model, tokenizer)
106
+ print("Detected entities:")
107
+ for entity in entities:
108
+ print(f"- {entity['text']} ({entity['type']})")
109
+ print("-" * 50)
110
+ ```
111
+
112
+ ## πŸ“Š Evaluation Results for Quantized Model
113
+
114
+ ### **πŸ”Ή Overall Performance**
115
+
116
+ - **Accuracy**: **97.10%** βœ…
117
+ - **Precision**: **89.52%**
118
+ - **Recall**: **90.67%**
119
+ - **F1 Score**: **90.09%**
120
+
121
+ ---
122
+
123
+ ### **πŸ”Ή Performance by Entity Type**
124
+
125
+ | Entity Type | Precision | Recall | F1 Score | Number of Entities |
126
+ |------------|-----------|--------|----------|--------------------|
127
+ | **LOC** (Location) | **91.46%** | **92.07%** | **91.76%** | 3,000 |
128
+ | **MISC** (Miscellaneous) | **71.25%** | **72.83%** | **72.03%** | 1,266 |
129
+ | **ORG** (Organization) | **89.83%** | **93.02%** | **91.40%** | 3,524 |
130
+ | **PER** (Person) | **95.16%** | **94.04%** | **94.60%** | 2,989 |
131
+
132
+ ---
133
+ #### ⏳ **Inference Speed Metrics**
134
+ - **Total Evaluation Time**: 15.89 sec
135
+ - **Samples Processed per Second**: 217.26
136
+ - **Steps per Second**: 27.18
137
+ - **Epochs Completed**: 3
138
+
139
+ ---
140
+ ## Fine-Tuning Details
141
+ ### Dataset
142
+ The Hugging Face's `wnut_17` dataset was used, containing texts and their ner tags.
143
+
144
+ ## πŸ“Š Training Details
145
+ - **Number of epochs**: 3
146
+ - **Batch size**: 16
147
+ - **Evaluation strategy**: epoch
148
+ - **Learning Rate**: 2e-5
149
+
150
+ ### ⚑ Quantization
151
+ Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
152
+
153
+ ---
154
+ ## πŸ“‚ Repository Structure
155
+ ```
156
+ .
157
+ β”œβ”€β”€ model/ # Contains the quantized model files
158
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
159
+ β”œβ”€β”€ model.safetensors/ # Quantized Model
160
+ β”œβ”€β”€ README.md # Model documentation
161
+ ```
162
+
163
+ ---
164
+ ## ⚠️ Limitations
165
+ - The model may not generalize well to domains outside the fine-tuning dataset.
166
+ - Quantization may result in minor accuracy degradation compared to full-precision models.
167
 
168
+ ---
169
+ ## 🀝 Contributing
170
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
171