File size: 3,692 Bytes
bcbfbfb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | ---
language:
- en
pipeline_tag: text-classification
tags:
- sms-spam
- phishing-detection
- scam-detection
- security
metrics:
- f1
- accuracy
- precision
- recall
widget:
- text: "Your account is blocked! Verify immediately with OTP. Send money to scam@ybl using https://scam.xyz/"
example_title: "Bank KYC Scam"
- text: "Congratulations! You won Rs 50,000 lottery prize. Contact urgently to claim via link: http://bit.ly/claim"
example_title: "Lottery Scam"
- text: "Hey, are we still meeting for lunch tomorrow at 12?"
example_title: "Safe Message"
---
# SCAMBERT: DistilBERT for SMS Fraud & Scam Detection
SCAMBERT is a fine-tuned `distilbert-base-uncased` model specifically designed to detect social engineering, financial fraud, phishing, and scam payloads in SMS and short-form conversational text. It is built as Layer 3 of the AI Honeypot (CIPHER) Threat Intelligence Pipeline.
## Model Summary
- **Model Type:** Text Classification (Binary)
- **Base Model:** `distilbert-base-uncased`
- **Language:** English (en)
- **Task:** Spam/Scam Detection
- **License:** MIT (or your project's license)
- **Size:** ~255 MB
### Labels
- `0`: Safe / Legitimate
- `1`: Scam / Fraud / Phishing
## Performance & Metrics
The model was fine-tuned on a dataset of **8,438** samples (27.5% Scam / 72.5% Safe). Due to class imbalance, class weights were applied during training.
### Calibration & Validation Results
- **Best Accuracy:** 99.41%
- **Best F1-Score:** 98.92%
- **Calibrated Precision:** 95.08%
- **Calibrated Recall:** 100.0%
- **Optimal Threshold:** `0.0028` (For high-recall environments)
### Robustness Evaluation
The model was tested against common bad-actor obfuscation tactics:
| Tactic | Example Input | Prediction Probability | Passed |
| :--- | :--- | :--- | :--- |
| **URL Obfuscation** | `Win $1000 fast! Click hxxp://scammy...` | 99.9% Scam | ✅ |
| **Numeric Substitution** | `W1NNER! Y0u have b33n select3d...` | 99.3% Scam | ✅ |
| **Mixed Case** | `cOnGrAtUlAtIoNs, yOu WoN a FrEe...` | 89.8% Scam | ✅ |
*Note: The model occasionally struggles with extremely short, contextless messages (e.g., "Call me now") as intended, relying on earlier heuristic layers for context.*
## Usage
You can use this model directly with Hugging Face's `pipeline`:
```python
from transformers import pipeline
# Load the pipeline
classifier = pipeline("text-classification", model="Digvijay05/SCAMBERT")
# Inference
text = "Earn Rs 5000 daily income from home part time. Click this link: http://bit.ly/job"
result = classifier(text)
print(result)
# [{'label': 'LABEL_1', 'score': 0.99...}]
```
Or run via the **Inference API**:
```python
import httpx
API_URL = "https://api-inference.huggingface.co/models/Digvijay05/SCAMBERT"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
response = httpx.post(API_URL, headers=headers, json={"inputs": "Your account is locked. Verify at bit.ly/secure"})
print(response.json())
```
## Deployment Considerations
- **CPU Latency Estimate:** ~10-30ms / sequence
- **GPU Latency Estimate:** ~2-5ms / sequence
- **Recommendation:** Can be efficiently hosted on serverless CPU environments (like Render Free Tier) using Hugging Face's Inference API, or deployed natively if 512MB+ RAM is available. ONNX quantization is recommended for edge deployments.
## Intended Use
This model is designed as a *semantic booster* tie-breaker layer within a multi-layered classification engine. It excels at detecting complex sentence structures, urgency, and manipulative context that standard Regex/Heuristic rules might miss.
|