baptistejamin's picture
Update README.md
3778541 verified
metadata
language:
  - en
  - fr
  - ru
  - zh
  - ja
  - pl
  - pt
  - nl
  - ar
base_model:
  - FacebookAI/xlm-roberta-large
pipeline_tag: text-classification
library_name: transformers
tags:
  - spam
  - email
  - phone
  - chat
widget:
  - text: >-
      GET 100 BTC right now! Just send 1 USD to metaservices@gmail.com via
      PayPal
    example_title: Spam
metrics:
  - f1

🛡️ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)

This repository provides a lightweight, encoder-only multi-turn classifier designed to detect spam and unwanted content across emails and chat conversations.

It supports short and long messages, as well as multi-turn conversational inputs (meta data + message)

It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.

This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.


✨ Features

  • Encoder-only architecture → gives scores
  • Multi-turn support → handles conversation history and context windows
  • Hybrid input domain → optimized for both chat messages & email bodies
  • High-throughput → suitable for millions of messages/day
  • Ideal for security filters (spam, scams, phishing, self-promotion content)
  • Open-source and deployable anywhere (CPU or GPU)

🔧 Model Architecture

  • Type: Encoder-only (XLM Roberta Large)

  • Input format:

[CONTEXT 1] [CONTEXT 2] ... [USER MESSAGE]

Labels include:

  • spam
  • regular (ham)
  • marketing
  • gibberish

Benchmark

  • F1 Spam: 0.90
  • F1 Regular: 0.95
  • F1 Marketing: 0.87
  • F1 Gibberish: 0.94

While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.