metadata
language:
- en
- fr
- ru
- zh
- ja
- pl
- pt
- nl
- ar
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: text-classification
library_name: transformers
tags:
- spam
- email
- phone
- chat
widget:
- text: >-
GET 100 BTC right now! Just send 1 USD to metaservices@gmail.com via
PayPal
example_title: Spam
metrics:
- f1
🛡️ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)
This repository provides a lightweight, encoder-only multi-turn classifier designed to detect spam and unwanted content across emails and chat conversations.
It supports short and long messages, as well as multi-turn conversational inputs (meta data + message)
It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.
This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.
✨ Features
- Encoder-only architecture → gives scores
- Multi-turn support → handles conversation history and context windows
- Hybrid input domain → optimized for both chat messages & email bodies
- High-throughput → suitable for millions of messages/day
- Ideal for security filters (spam, scams, phishing, self-promotion content)
- Open-source and deployable anywhere (CPU or GPU)
🔧 Model Architecture
Type: Encoder-only (XLM Roberta Large)
Input format:
[CONTEXT 1] [CONTEXT 2] ... [USER MESSAGE]
Labels include:
spamregular(ham)marketinggibberish
Benchmark
- F1 Spam: 0.90
- F1 Regular: 0.95
- F1 Marketing: 0.87
- F1 Gibberish: 0.94
While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.