📌 Overview

XLM-Prohori-v2 is a fine-tuned XLM-RoBERTa-base model for detecting smishing (SMS phishing) in Bangla and English.
It classifies SMS into three categories:

normal → Casual, harmless, informational texts
promo → Promotional/advertising messages
smish → Smishing (phishing via SMS) attempts

📊 Dataset

Total samples (after deduplication): ~4,507
Languages: Bangla, English, Banglish
Labels: balanced across normal, promo, smish
Preprocessing: All URLs normalized to [LINK]; duplicates removed; stratified train/val/test split
Splits: Train=3064, Val=541, Test=902 (verified zero overlap)

The raw dataset is not publicly released for privacy reasons. Some synthetic smish examples were included to balance classes.

📈 Performance

Validation Accuracy: ~97.60%
Test Accuracy: ~97.23%

Confusion matrices indicate generally balanced performance, with minor confusion between promo and smish in link-heavy texts.

🚀 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import torch

model_id = "squadgoals404/XLM-Prohori-v2"
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSequenceClassification.from_pretrained(model_id)

# Force semantic class names in-memory
CLASS_NAMES = ["normal", "promo", "smish"]  # make sure the order matches 0,1,2
mdl.config.id2label = {i: c for i, c in enumerate(CLASS_NAMES)}
mdl.config.label2id = {c: i for i, c in enumerate(CLASS_NAMES)}

text = "Bank Account temporarily locked—identity verify করতে জরুরি কল 017XX-XXXXXX"
inputs = tok(text, return_tensors="pt")
with torch.no_grad():
    probs = F.softmax(mdl(**inputs).logits, dim=-1).squeeze().tolist()

print({CLASS_NAMES[i]: round(p, 4) for i, p in enumerate(probs)})

Downloads last month: 10

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for squadgoals404/XLM-Prohori-v2

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4100)

this model

Quantizations

1 model