Update README.md

3778541 verified 5 months ago

1.77 kB

language:
  - en
  - fr
  - ru
  - zh
  - ja
  - pl
  - pt
  - nl
  - ar
base_model:
  - FacebookAI/xlm-roberta-large
pipeline_tag: text-classification
library_name: transformers
tags:
  - spam
  - email
  - phone
  - chat
widget:
  - text: >-
      GET 100 BTC right now! Just send 1 USD to metaservices@gmail.com via
      PayPal
    example_title: Spam
metrics:
  - f1

🛡️ Hybrid Chat + Email Spam Classifier (Encoder-Only, Multi-Turn)

This repository provides a lightweight, encoder-only multi-turn classifier designed to detect spam and unwanted content across emails and chat conversations.

It supports short and long messages, as well as multi-turn conversational inputs (meta data + message)

It was trained using a mixed dataset emails, support chats, and messaging threads, using a 14B Teacher Model.

This model is a fast-encoder only model, trained from distillation with a 14B Teacher Model on a 20M records dataset.

✨ Features

Encoder-only architecture → gives scores
Multi-turn support → handles conversation history and context windows
Hybrid input domain → optimized for both chat messages & email bodies
High-throughput → suitable for millions of messages/day
Ideal for security filters (spam, scams, phishing, self-promotion content)
Open-source and deployable anywhere (CPU or GPU)

🔧 Model Architecture

Type: Encoder-only (XLM Roberta Large)
Input format:

[CONTEXT 1] [CONTEXT 2] ... [USER MESSAGE]

Labels include:

spam
regular (ham)
marketing
gibberish

Benchmark

F1 Spam: 0.90
F1 Regular: 0.95
F1 Marketing: 0.87
F1 Gibberish: 0.94

While this model is not perfect, it is excellent at quickly catching spam and is way better than bayesian filters.