| **π§ SMSDetection-DistilBERT-SMS** | |
| A DistilBERT-based binary classifier fine-tuned on the SMS Spam Collection dataset. It classifies messages as either **spam** or **ham** (not spam). This model is suitable for real-world applications like mobile SMS spam filters, automated customer message triage, and telecom fraud detection. | |
| --- | |
| β¨ **Model Highlights** | |
| - π Based on `distilbert-base-uncased` | |
| - π Fine-tuned on the SMS Spam Collection dataset | |
| - β‘ Supports binary classification: Spam vs Not Spam | |
| - πΎ Lightweight and optimized for both CPU and GPU environments | |
| --- | |
| π§ Intended Uses | |
| - β Mobile SMS spam filtering | |
| - β Telecom customer service automation | |
| - β Fraudulent message detection | |
| - β User inbox categorization | |
| - β Regulatory compliance monitoring | |
| --- | |
| - π« Limitations | |
| - β Trained on English SMS messages only | |
| - β May underperform on emails, social media texts, or non-English content | |
| - β Not designed for multilingual datasets | |
| - β Slight performance dip expected for long messages (>128 tokens) | |
| --- | |
| ποΈββοΈ Training Details | |
| | Field | Value | | |
| | -------------- | ------------------------------ | | |
| | **Base Model** | `distilbert-base-uncased` | | |
| | **Dataset** |SMS Spam Collection (UCI) | | |
| | **Framework** | PyTorch with π€ Transformers | | |
| | **Epochs** | 3 | | |
| | **Batch Size** | 16 | | |
| | **Max Length** | 128 tokens | | |
| | **Optimizer** | AdamW | | |
| | **Loss** | CrossEntropyLoss (token-level) | | |
| | **Device** | Trained on CUDA-enabled GPU | | |
| --- | |
| π Evaluation Metrics | |
| | Metric | Score | | |
| | ----------------------------------------------- | ----- | | |
| | Accuracy | 0.99 | | |
| | F1-Score | 0.96 | | |
| | Precision | 0.98 | | |
| | Recall | 0.93 | | |
| --- | |
| --- | |
| π Usage | |
| ```python | |
| from transformers import BertTokenizerFast, BertForTokenClassification | |
| from transformers import pipeline | |
| import torch | |
| model_name = "AventIQ-AI/SMS-Spam-Detection-Model" | |
| tokenizer = BertTokenizerFast.from_pretrained(model_name) | |
| model = BertForTokenClassification.from_pretrained(model_name) | |
| model.eval() | |
| # Inference | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| model.to(device) | |
| def predict_sms(text): | |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) | |
| inputs = {k: v.to(device) for k, v in inputs.items()} | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| logits = outputs.logits | |
| predicted = torch.argmax(logits, dim=1).item() | |
| return "spam" if predicted == 1 else "ham" | |
| # Test example | |
| print(predict_sms("You've won $1,000,000! Call now to claim your prize!")) | |
| ``` | |
| --- | |
| - π§© Quantization | |
| - Post-training static quantization applied using PyTorch to reduce model size and accelerate inference on edge devices. | |
| ---- | |
| π Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Quantized model files | |
| βββ tokenizer_config/ # Tokenizer and vocab files | |
| βββ model.safensors/ # Fine-tuned model in safetensors format | |
| βββ README.md # Model card | |
| ``` | |
| --- | |
| π€ Contributing | |
| Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model. | |