HinSpam — Hinglish Spam & Ham Classifier

HinSpam is a binary text classifier trained to detect spam messages in Hinglish — the code-mixed Hindi-English variety dominant across Indian SMS, WhatsApp, and online messaging platforms. It distinguishes between spam (unsolicited promotional, scam, or phishing messages) and ham (genuine, organic conversation), with strong performance on the kinds of deceptive, urgent-language spam that targets Indian mobile users.

Model Description

Property	Details
Task	Binary Text Classification
Label 0	`ham` — legitimate, non-spam message
Label 1	`spam` — unsolicited, scam, or phishing message
Language	Hinglish (Hindi-English code-mixed, Roman script)
Domain	SMS, WhatsApp, online messaging

Key Features

📵 Hinglish-Native Spam Detection

Most spam classifiers are trained on English or pure Hindi corpora and fail on the code-mixed reality of Indian messaging. HinSpam is trained natively on Hinglish, capturing the mixed-language patterns used both by real users and by scammers targeting Indian audiences.

🎣 Phishing & Scam Pattern Recognition

The model is specifically tuned to catch common Indian spam tropes:

Fake lottery and prize claims — e.g., "50,000 rupey jeete hain, abhi claim karein"
OTP forwarding scams — requests to share OTPs with "managers" or "bank officials"
Malicious links — suspicious URLs embedded in prize or account-alert messages
Urgency language — pressure tactics like "abhi click karein" or "sirf aaj ke liye"
Impersonation — messages posing as banks, government schemes, or delivery services

💬 Low False-Positive Rate on Casual Hinglish

Everyday Hinglish conversation — complaints, plans, banter — is often informal and noisy. HinSpam maintains high precision on ham, avoiding the over-flagging that plagues generic spam filters on non-English text.

Performance Metrics

Evaluated on a held-out test set of Hinglish messages:

Metric	Score
Accuracy	0.9530
F1 Score	0.9490
Precision	0.9340
Recall	0.9650
MCC (Matthews Correlation Coefficient)	0.9060

Recall = 0.9650 — The model catches the vast majority of spam, making it effective as a first-pass filter in messaging pipelines.

MCC = 0.9060 — A strong balanced metric confirming reliable classification across both classes, even under mild label imbalance.

Intended Use

HinSpam is designed for:

SMS and messaging spam filters for Indian telecom and app platforms
WhatsApp / chat moderation tools targeting scam and phishing messages in Hinglish
Financial fraud prevention — catching OTP-forwarding and fake prize scams before they reach users
Research on code-mixed spam detection and NLP for Indian languages

Training Data

The model was trained on a curated dataset of Hinglish messages covering a wide range of spam and ham examples:

Label 0 (ham): Casual conversation, personal messages, everyday Hinglish banter
Label 1 (spam): Lottery scams, OTP phishing, prize claim fraud, malicious links, fake bank alerts

Example Data Points

"Tera naya job kaisa chal raha hai boss kaisa hai wahan ka theek hai na.", ham
"Aapke account mein scratch winner se jeete hue 50000 rupey credit krene ke liye apna otp diye gaye number par forward kre", spam
"Bhai tu kab sudhrega hamesha late aane ki aadat hai teri toh pakki.", ham
"500000 jitne ke liye is link par click kre: www.maha-winner-india.in", spam
"Aaj bahut bore ho raha hu yaar kuch karne ka plan bata theek sa.", ham
"Aapke account mein lottery se jeete hue 100000 rupey credit krene ke liye apna otp manager ko forward kre aur claim karein", spam

Related Models

HinTox — Hinglish hate speech and abusive language detector with obfuscation robustness (leetspeak, misspellings)

Citation

If you use HinSpam in your research or product, please cite:

@misc{hinspam2025,
  title     = {HinSpam: Hinglish Spam Detection for Code-Mixed Indian Messaging},
  year      = {2026},
  note      = {HuggingFace Model Hub},
  url       = {https://huggingface.co/Keshav0av/HinSpam}
}

Contact

For questions, feedback, or collaboration, open an issue on the model repository or reach out via HuggingFace.

Downloads last month: 33

Safetensors

Model size

0.1B params

Tensor type

F32