π‘οΈ SilverGuard β Indian SMS Scam Detector
SilverGuard is a fine-tuned MobileBERT model exported to ONNX for detecting scam and phishing SMS messages in the Indian context. It outputs a continuous threat score (0.0β1.0) and is designed for real-time inference on mobile and edge devices via onnxruntime.
π§ Model Overview
| Property | Details |
|---|---|
| Base Model | google/mobilebert-uncased (24M params) |
| Task | Binary SMS classification β Ham (0) vs Scam (1) |
| Output | threat_score β softmax scam probability [0.0β1.0] |
| Export Format | ONNX (opset 14), single self-contained file |
| Max Sequence Length | 128 tokens |
| Target Deployment | onnxruntime, onnxruntime_flutter |
π¦ Files Included
| File | Description |
|---|---|
silver_guard.onnx |
Self-contained ONNX model (~90 MB) |
vocab.txt |
MobileBERT WordPiece vocabulary (~230 KB) |
model_config.json |
Inference configuration (max_length, labels, input format) |
π Training Data
~18,000 messages across four sources:
| Source | Type | Count |
|---|---|---|
| UCI SMS Spam Collection | Kaggle | ~5,500 |
| India SMS Spam Classification | Kaggle | ~2,200 |
| Synthetic Indian Scam Templates | Generated | ~5,200 |
| Personal SMS (Phone Link export) | User Data | ~5,100 |
Split: 80% train / 10% validation / 10% test (stratified)
ποΈ Input Format β TRAI DLT Header System
All inputs follow the format:
HEADER [SEP] message body
The HEADER is the TRAI DLT sender ID or phone number β a critical signal for scam detection:
| Header Type | Example | Meaning |
|---|---|---|
| Registered DLT (bank) | JD-SBINOT |
β Legitimate transactional |
| Registered DLT (govt) | DL-UIDAIG |
β Legitimate government |
| Raw phone number | +919876543210 |
π¨ Strong scam indicator |
| Gibberish / spoofed | VM-URGENT, XX-WINNER |
π¨ Scam indicator |
DLT suffix convention: G = Government Β· T = Transactional Β· S = Service Β· P = Promotional
If no sender ID is available (e.g. personal messages), pass the message body directly without a header.
π¨ Scam Categories Covered
The model was trained on 11 Indian scam archetypes:
- Digital Arrest β CBI/Police impersonation, Aadhaar fraud threats
- Bank Freeze / KYC β Account suspension, fake KYC update links
- OTP Fraud β "Share OTP to cancel/confirm transaction"
- Link Phishing β Fake rewards, government subsidies, shortened URLs
- Lottery / Prize β KBC, WhatsApp lucky draw, Google Annual Prize
- Job Scam β Work from home, Telegram task jobs
- Parcel / Courier β Fake customs duty, contraband claims
- Insurance / Loan β Pre-approved loan fraud, LIC bonus scams
- Government Impersonation β TRAI, RBI, EPFO, Income Tax fake notices
- Investment Scam β Crypto, Forex, guaranteed returns
- Utility Scam β Fake electricity/gas disconnection threats
βοΈ Usage
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
import unicodedata
import re
# ββ Minimal tokenizer (mirrors google/mobilebert-uncased exactly) βββββββββββββ
class BertTokenizer:
def __init__(self, vocab_file):
with open(vocab_file, encoding="utf-8") as f:
self.vocab = {line.rstrip("\n"): i for i, line in enumerate(f)}
self.cls_id = self.vocab.get("[CLS]", 101)
self.sep_id = self.vocab.get("[SEP]", 102)
self.pad_id = self.vocab.get("[PAD]", 0)
self.unk_id = self.vocab.get("[UNK]", 100)
def _wordpiece(self, word):
ids, start = [], 0
while start < len(word):
end, cur = len(word), None
while start < end:
sub = ("##" if start > 0 else "") + word[start:end]
if sub in self.vocab:
cur = self.vocab[sub]; break
end -= 1
if cur is None:
return [self.unk_id]
ids.append(cur); start = end
return ids
def encode(self, text_a, text_b=None, max_length=128):
text_a = unicodedata.normalize("NFD", text_a.lower())
ids_a = [wp for tok in re.findall(r'\w+|[^\w\s]', text_a)
for wp in self._wordpiece(tok)]
ids_b = None
if text_b:
text_b = unicodedata.normalize("NFD", text_b.lower())
ids_b = [wp for tok in re.findall(r'\w+|[^\w\s]', text_b)
for wp in self._wordpiece(tok)]
budget = max_length - (3 if ids_b else 2)
if ids_b:
while len(ids_a) + len(ids_b) > budget:
(ids_a if len(ids_a) >= len(ids_b) else ids_b).pop()
else:
ids_a = ids_a[:budget]
tokens = [self.cls_id] + ids_a + [self.sep_id]
if ids_b:
tokens += ids_b + [self.sep_id]
mask = [1] * len(tokens)
pad = max_length - len(tokens)
tokens += [self.pad_id] * pad
mask += [0] * pad
return tokens, mask
# ββ Inference ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
session = ort.InferenceSession("model/silver_guard.onnx")
tokenizer = BertTokenizer("model/vocab.txt")
def predict(header: str, message: str) -> float:
"""Returns threat score 0.0 (safe) β 1.0 (scam)."""
ids, mask = tokenizer.encode(header, message) if header else tokenizer.encode(message)
result = session.run(None, {
"input_ids": np.array([ids], dtype=np.int64),
"attention_mask": np.array([mask], dtype=np.int64),
})
return float(result[0][0][0]) # already softmaxed β do NOT apply softmax again
# Examples
print(predict("+919876543210", "Your Aadhaar is linked to money laundering. Call CBI now.")) # β ~1.0
print(predict("JD-SBINOT", "Your a/c XX5678 credited with Rs 25,000 by NEFT. -SBI")) # β ~0.0
print(predict("", "Hey, are we meeting for dinner tonight?")) # β ~0.0
Verdict Thresholds
| Score | Verdict |
|---|---|
| β₯ 0.80 | π¨ HIGH RISK SCAM |
| β₯ 0.55 | β οΈ LIKELY SCAM |
| β₯ 0.40 | π‘ BORDERLINE |
| < 0.40 | β SAFE (HAM) |
Flutter / Dart
# pubspec.yaml
flutter:
assets:
- assets/silver_guard.onnx
- assets/vocab.txt
- assets/model_config.json
dependencies:
onnxruntime_flutter: ^1.0.0
final session = await OrtSession.fromAsset('assets/silver_guard.onnx');
// Combine as: "$senderHeader [SEP] $messageBody"
// Tokenize with WordPiece β pad to 128 β run session
// Output: threat_score [0.0 β 1.0]
π§ ONNX Architecture
Input: input_ids [batch, 128] int64
attention_mask [batch, 128] int64
β
MobileBERT Encoder (24 transformer layers)
β
Classification Head (linear β 2 logits)
β
Softmax β scam probability
β
Output: threat_score [batch, 1] float32
Important: The output is already softmaxed. Do not apply softmax again.
ποΈ Training Configuration
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 2e-5 |
| Batch Size | 32 |
| Max Epochs | 4 |
| Early Stopping | Patience = 2 |
| Gradient Clipping | max_norm = 1.0 |
| Warmup | 10% of total steps |
| Runtime | Google Colab T4 GPU |
β οΈ Limitations
- Optimized for Indian SMS β performance on non-Indian content may vary
- Scam tactics evolve rapidly; periodic retraining is recommended
- Short messages without a header (<10 tokens) may be less reliable
- Model does not analyze embedded URLs for malicious content
π License
MIT β free for commercial and personal use.
Model tree for tanishqmudaliar/SilverGuard
Base model
google/mobilebert-uncased