π XLM-Prohori-v2: Bangla/English SMS Smishing Classifier
Repository: squadgoals404/XLM-Prohori-v2
Base Model: xlm-roberta-base
π Overview
XLM-Prohori-v2 is a fine-tuned XLM-RoBERTa-base model for detecting smishing (SMS phishing) in Bangla and English.
It classifies SMS into three categories:
- normal β Casual, harmless, informational texts
- promo β Promotional/advertising messages
- smish β Smishing (phishing via SMS) attempts
π Dataset
- Total samples (after deduplication): ~4,507
- Languages: Bangla, English, Banglish
- Labels: balanced across
normal,promo,smish - Preprocessing: All URLs normalized to
[LINK]; duplicates removed; stratified train/val/test split - Splits: Train=3064, Val=541, Test=902 (verified zero overlap)
The raw dataset is not publicly released for privacy reasons. Some synthetic smish examples were included to balance classes.
π Performance
- Validation Accuracy: ~97.60%
- Test Accuracy: ~97.23%
Confusion matrices indicate generally balanced performance, with minor confusion between promo and smish in link-heavy texts.
π Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import torch
model_id = "squadgoals404/XLM-Prohori-v2"
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSequenceClassification.from_pretrained(model_id)
# Force semantic class names in-memory
CLASS_NAMES = ["normal", "promo", "smish"] # make sure the order matches 0,1,2
mdl.config.id2label = {i: c for i, c in enumerate(CLASS_NAMES)}
mdl.config.label2id = {c: i for i, c in enumerate(CLASS_NAMES)}
text = "Bank Account temporarily lockedβidentity verify ΰ¦ΰ¦°ΰ¦€ΰ§ ΰ¦ΰ¦°ΰ§ΰ¦°ΰ¦Ώ ΰ¦ΰ¦² 017XX-XXXXXX"
inputs = tok(text, return_tensors="pt")
with torch.no_grad():
probs = F.softmax(mdl(**inputs).logits, dim=-1).squeeze().tolist()
print({CLASS_NAMES[i]: round(p, 4) for i, p in enumerate(probs)})
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support