baptistejamin
/

xlm-roberta-large-spam_v4

Text Classification

text-embeddings-inference

Model card Files Files and versions

Metrics Training metrics Community

baptistejamin commited on Nov 19, 2025

Commit

3778541

·

verified ·

1 Parent(s): e7cd1e3

Update README.md

Files changed (1) hide show

README.md +53 -1

README.md CHANGED Viewed

@@ -23,4 +23,56 @@ widget:
   example_title: "Spam"
 metrics:
 - f1
----

   example_title: "Spam"
 metrics:
 - f1
+---
+# 🛡️ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)
+This repository provides a **lightweight, encoder-only multi-turn classifier** designed to detect spam and unwanted content across **emails** and **chat conversations**.
+It supports short and long messages, as well as **multi-turn conversational inputs** (meta data + message)
+It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.
+This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.
+---
+## ✨ Features
+- **Encoder-only architecture** → gives scores
+- **Multi-turn support** → handles conversation history and context windows
+- **Hybrid input domain** → optimized for both chat messages & email bodies
+- **High-throughput** → suitable for millions of messages/day
+- **Ideal for security filters** (spam, scams, phishing, self-promotion content)
+- **Open-source** and deployable anywhere (CPU or GPU)
+---
+## 🔧 Model Architecture
+- **Type:** Encoder-only (XLM Roberta Large)
+- **Input format:**
+[CONTEXT 1]
+[CONTEXT 2]
+...
+[USER MESSAGE]
+Labels include:
+- `spam`
+- `regular` (ham)
+- `marketing`
+- `gibberish`
+---
+## Benchmark
+- F1 Spam: 0.90
+- F1 Regular: 0.95
+- F1 Marketing: 0.87
+- F1 Gibberish: 0.94
+While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.