| # BERT Base Uncased Quantized Model for Spam Detection | |
| This repository hosts a quantized version of the BERT model, fine-tuned for spam detection tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments. | |
| ## Model Details | |
| - **Model Architecture:** BERT Base Uncased | |
| - **Task:** Spam Email Detection | |
| - **Dataset:** Hugging Face's `mail_spam_ham_dataset` and 'spam-mail' | |
| - **Quantization:** Float16 | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| ## Usage | |
| ### Installation | |
| ```sh | |
| pip install transformers torch | |
| ``` | |
| ### Loading the Model | |
| ```python | |
| from transformers import BertTokenizer, BertForSequenceClassification | |
| import torch | |
| model_name = "AventIQ-AI/bert-spam-detection" | |
| tokenizer = BertTokenizer.from_pretrained(model_name) | |
| model = BertForSequenceClassification.from_pretrained(model_name) | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| def predict_spam_quantized(text): | |
| """Predicts whether a given text is spam (1) or ham (0) using the quantized BERT model.""" | |
| # Tokenize input text | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) | |
| # Move inputs to GPU (if available) | |
| inputs = {key: value.to(device) for key, value in inputs.items()} | |
| # Perform inference | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| # Get predicted label (0 = ham, 1 = spam) | |
| prediction = torch.argmax(outputs.logits, dim=1).item() | |
| return "Spam" if prediction == 1 else "Ham" | |
| # Sample test messages | |
| print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea ΓΒ£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")) | |
| # Expected output: Spam | |
| print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea ΓΒ£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")) | |
| # Expected output: Ham | |
| ``` | |
| ## π Classification Report (Quantized Model - float16) | |
| | Metric | Class 0 (Non-Spam) | Class 1 (Spam) | Macro Avg | Weighted Avg | | |
| |------------|----------------|----------------|------------|--------------| | |
| | **Precision** | 1.00 | 0.98 | 0.99 | 0.99 | | |
| | **Recall** | 0.99 | 0.99 | 0.99 | 0.99 | | |
| | **F1-Score** | 0.99 | 0.99 | 0.99 | 0.99 | | |
| | **Accuracy** | **99%** | **99%** | **99%** | **99%** | | |
| ### π **Observations** | |
| β **Precision:** High (1.00 for non-spam, 0.98 for spam) β **Few false positives** | |
| β **Recall:** High (0.99 for both classes) β **Few false negatives** | |
| β **F1-Score:** **Near-perfect balance** between precision & recall | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The Hugging Face's 'spam-mail' and 'mail_spam_ham_dataset' datasets are combined together and used, containing both spam and ham (non-spam) examples. | |
| ### Training | |
| - Number of epochs: 3 | |
| - Batch size: 8 | |
| - Evaluation strategy: epoch | |
| - Learning rate: 2e-5 | |
| ### Quantization | |
| Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency. | |
| ## Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Contains the quantized model files | |
| βββ tokenizer_config/ # Tokenizer configuration and vocabulary files | |
| βββ model.safetensors/ # Fine Tuned Model | |
| βββ README.md # Model documentation | |
| ``` | |
| ## Limitations | |
| - The model may not generalize well to domains outside the fine-tuning dataset. | |
| - Quantization may result in minor accuracy degradation compared to full-precision models. | |
| ## Contributing | |
| Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements. | |