baptistejamin commited on
Commit
3778541
·
verified ·
1 Parent(s): e7cd1e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -23,4 +23,56 @@ widget:
23
  example_title: "Spam"
24
  metrics:
25
  - f1
26
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  example_title: "Spam"
24
  metrics:
25
  - f1
26
+ ---
27
+
28
+ # 🛡️ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)
29
+
30
+ This repository provides a **lightweight, encoder-only multi-turn classifier** designed to detect spam and unwanted content across **emails** and **chat conversations**.
31
+
32
+ It supports short and long messages, as well as **multi-turn conversational inputs** (meta data + message)
33
+
34
+ It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.
35
+
36
+ This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.
37
+
38
+ ---
39
+
40
+ ## ✨ Features
41
+
42
+ - **Encoder-only architecture** → gives scores
43
+ - **Multi-turn support** → handles conversation history and context windows
44
+ - **Hybrid input domain** → optimized for both chat messages & email bodies
45
+ - **High-throughput** → suitable for millions of messages/day
46
+ - **Ideal for security filters** (spam, scams, phishing, self-promotion content)
47
+ - **Open-source** and deployable anywhere (CPU or GPU)
48
+
49
+ ---
50
+
51
+ ## 🔧 Model Architecture
52
+
53
+ - **Type:** Encoder-only (XLM Roberta Large)
54
+
55
+ - **Input format:**
56
+
57
+ [CONTEXT 1]
58
+ [CONTEXT 2]
59
+ ...
60
+ [USER MESSAGE]
61
+
62
+ Labels include:
63
+
64
+ - `spam`
65
+ - `regular` (ham)
66
+ - `marketing`
67
+ - `gibberish`
68
+
69
+ ---
70
+
71
+ ## Benchmark
72
+
73
+ - F1 Spam: 0.90
74
+ - F1 Regular: 0.95
75
+ - F1 Marketing: 0.87
76
+ - F1 Gibberish: 0.94
77
+
78
+ While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.