Spaces:

timagonch
/

algospeak-classifier

Running

File size: 14,188 Bytes

"""
app.py — Algospeak Classifier demo

Streamlit UI for the dual BERTweet model.
Type a social media post and see the predicted class + confidence scores.
Predictions are logged to a private HF dataset repo via CommitScheduler.
"""

import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent / "poc" / "src"))

import csv
import yaml
import torch
import numpy as np
import emoji
import streamlit as st
from datetime import datetime
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download, CommitScheduler

from inference import load_unsupervised_encoder, classify_text

BASE_DIR = Path(__file__).parent
MODEL_REPO = "timagonch/algospeak-classifier-model"
LOG_REPO   = "timagonch/algospeak-logs"
LOG_DIR    = BASE_DIR / "logs"
LOG_FILE   = LOG_DIR / "predictions.csv"
LOG_COLS   = ["text", "predicted_label", "score_allowed", "score_obscene", "score_mature", "score_algospeak", "timestamp"]

CLASS_COLORS = {
    "Allowed":          "green",
    "Obscene Language": "red",
    "Mature Content":   "orange",
    "Algospeak":        "violet",
}

ABOUT_MD = """
## Algospeak Classifier — Project Overview

This tool is the result of a semester-long research project exploring **algospeak detection** as part of a content moderation pipeline for social media. The goal was to classify posts not just by whether they contain harmful content, but by *how* that content is expressed — including coded language specifically designed to evade automated filters.

---

### What is Algospeak?

Algospeak is a form of linguistic camouflage that emerged organically on platforms like TikTok, Bluesky, and Twitter/X. When users learn that certain words trigger automated takedowns, they develop workarounds — substitutions that carry the same meaning but bypass keyword filters:

- **"unalive"** instead of suicide or self-harm
- **"corn"** for explicit sexual content
- **"k!ll", "k1ll", "k.i.l.l"** for violence
- Phonetic swaps (e.g. "seggs"), emoji substitutions, abbreviations, repurposed innocent words

The challenge is that these substitutions evolve constantly, vary by community, and are nearly impossible to keep up with using hand-crafted rules. The only durable solution is a model that understands *intent* from context.

---

### Architecture

The model is a **Dual BERTweet** network — two separate BERTweet encoders (vinai/bertweet-base, 270M parameters each) trained jointly with a contrastive learning objective called Supervised InfoNCE:

- **Supervised encoder** — receives label-prefixed text during training (e.g. `"Algospeak: gonna unalive myself"`). Acts as a teacher by injecting class identity directly into the text.
- **Unsupervised encoder** — receives raw text only, and is trained to match the supervised encoder's embedding space via the InfoNCE loss.

After training, the supervised encoder is discarded entirely. At inference, the unsupervised encoder embeds an incoming post and compares it via cosine similarity against four **class prototypes** — the average embedding per class computed from the training set. The nearest prototype wins. The algospeak prototype uses inverse deny-term frequency weighting so rarer coded forms aren't drowned out by common ones.

This approach was chosen specifically because it requires no rulesets, no exemplar lookup, and no deny list at inference time — just a single forward pass and a dot product.

---

### Data Collection & Manual Reclassification

The dataset was built from Bluesky social media posts collected by the team. Raw posts came in with initial labels, but those labels were noisy — so a careful manual re-review pass was done across the dataset.

To improve consistency on the class 1 and 2 boundary, **two deny lists** were built:
- `deny_list_class1.txt` — 115 terms covering slurs and hate speech
- `deny_list_class2.txt` — 521 terms covering explicit sexual content, drugs, and violence

A reclassification script applied deny-list hit logic: if a post contained a term from a list and had been labeled in the wrong class, it was overridden. This pass changed ~25,000 labels across the dataset, producing a cleaner `reclassified_final.csv` as the new source of truth.

---

### Synthetic Algospeak Generation

Class 3 (Algospeak) was by far the hardest class to collect naturally. Real algospeak examples are sparse and inconsistently labeled. To address this, a **GPT-4-turbo generation pipeline** was built that takes class 1 and 2 posts and transforms them into algospeak equivalents.

The pipeline used a 7-technique taxonomy grounded in documented community behavior:
character substitution, phonetic swaps, pictorial (emoji), abbreviation, repurposing of innocent words, paraphrase, and known community-specific terms. Each term was assigned a technique only if there was a documented example in a hints file — preventing the model from hallucinating plausible-but-wrong substitutions. A deny-term inflection detector ensured that forms like "stabbing" (not just "stab") were correctly passed to the generator.

This produced **13,264 algospeak pairs** (original + transformed), with the original post always kept in the same split as its algospeak counterpart to prevent leakage.

---

### Training Progression

The model went through several iterations as the dataset and architecture evolved:

**~10k/class — first dual BERTweet run (Apr 6)**
The 414-rule exemplar system was abandoned and replaced with the dual BERTweet architecture. The first full run used ~10,000 posts per class from the cleaned dataset, with a simple random split. Result: **test accuracy 79.9%**.

**~13k/class — group-aware split added (Apr 12)**
The dataset grew to ~13,300 posts per class using the full synthetic pairs. Critically, a **group-aware split** was introduced: original posts and their algospeak counterparts are always assigned to the same split. Without this, the model can train on a post and be evaluated on a near-identical transformed version — inflating results. With it: **test accuracy 85.9%**.

**~13k/class — weighted prototype + fix (Apr 13)**
The algospeak class prototype was upgraded to use inverse deny-term frequency weighting, giving rarer substitution forms more influence on the prototype center. A data loader fix was also applied. Result: **test accuracy 89.4%** — the best result on the full dataset.

**LLM audit & reclassification (Apr 16)**
A GPT-4o-mini audit reclassified ~39,000 posts from the existing splits. The LLM had stricter criteria for class 2 (Mature Content), which collapsed many borderline posts into class 0. This reduced class 2 to ~3,300 posts — a sharp drop from 13k — and the new splits had to be rebalanced much smaller. Result: **test accuracy 76.5%**. The bottleneck had shifted to class 2.

**3-class experiment (Apr 16)**
As a parallel track, classes 1 and 2 were merged into a single "Harmful Content" class, reducing the problem to 3 classes. With fewer boundaries to learn, the model performed strongly: **test accuracy 89.2%, Algospeak F1 = 93.8%**. This confirmed the architecture works well — the difficulty is class 1 vs 2 separation.

---

### Four-Class Controlled Experiment (This Model)

With the full dataset constrained by class 2 data scarcity, a focused experiment was run using a cleaner, smaller, more carefully curated subset of ~874 posts per class. The synthetic generation pipeline was rerun with stricter controls, producing 429 new algospeak examples. Two deny lists were merged into a single experiment-local list to avoid cross-contamination between class 1 and 2 deny terms.

#### Temperature Ablation

Temperature (τ) controls the sharpness of the contrastive loss gradient. Lower τ forces tighter clusters — which risks overfitting on small datasets. Higher τ acts as regularization. Four runs were compared:

| Run | τ | Test Acc | Macro F1 | Algospeak F1 | Mean AUC |
|-----|------|----------|----------|--------------|----------|
| 1 | 0.10 | 0.7918 | 0.7957 | 0.9032 | 0.9452 |
| 2 | 0.07 | 0.7214 | 0.7256 | 0.8138 | 0.8979 |
| **3 ✓** | **0.15** | **0.8065** | **0.8083** | **0.9045** | 0.9351 |
| 4 | 0.20 | 0.8240 | 0.8252 | 0.9161 | 0.9345 |

Run 4 (τ=0.20) had the best aggregate numbers — but misclassified *"gonna unalive myself fr fr cant take this anymore"* as **Allowed**. That is one of the most well-documented suicide-related algospeak phrases in existence. A false negative on a phrase like that represents a worse failure than a 1.7% drop in overall accuracy, so **τ=0.15 was chosen as the final model**.

---

### Final Model — τ = 0.15

| Metric | Val | Test |
|---|---|---|
| Accuracy | 0.8642 | 0.8065 |
| Macro F1 | 0.8648 | 0.8083 |
| Mean AUC | 0.9600 | 0.9351 |

**Per-class test performance:**

| Class | Precision | Recall | F1 |
|---|---|---|---|
| Allowed | 0.8065 | 0.8621 | 0.8333 |
| Obscene Language | 0.7363 | 0.7701 | 0.7528 |
| Mature Content | 0.7750 | 0.7126 | 0.7425 |
| Algospeak | 0.9221 | 0.8875 | **0.9045** |

Algospeak is the strongest class — which is the point. The remaining error is concentrated at the Obscene Language / Mature Content boundary, where surface vocabulary overlaps significantly (words like "rape" or "shoot" appear in both) and only broader context separates them.

---

*Built with BERTweet (VinAI), PyTorch, and Streamlit. Spring 2026.*
"""


@st.cache_resource(show_spinner="Loading model...")
def load_model():
    with open(BASE_DIR / "poc" / "config.yaml") as f:
        cfg = yaml.safe_load(f)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename="best_model.pt")
    prototypes_path = hf_hub_download(repo_id=MODEL_REPO, filename="prototypes.npy")

    encoder = load_unsupervised_encoder(checkpoint_path, cfg, device)
    prototypes = np.load(prototypes_path)
    tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"], use_fast=False)
    return encoder, prototypes, tokenizer, cfg, device


@st.cache_resource
def get_scheduler():
    import shutil
    LOG_DIR.mkdir(exist_ok=True)
    try:
        existing = hf_hub_download(
            repo_id=LOG_REPO,
            filename="logs/predictions.csv",
            repo_type="dataset",
        )
        shutil.copy(existing, LOG_FILE)
    except Exception:
        pass
    return CommitScheduler(
        repo_id=LOG_REPO,
        repo_type="dataset",
        folder_path=LOG_DIR,
        path_in_repo="logs",
        every=5,
    )


def log_prediction(text, result):
    scheduler = get_scheduler()
    scores = result["scores"]
    row = {
        "text":            text,
        "predicted_label": result["predicted_label"],
        "score_allowed":   round(scores["Allowed"], 4),
        "score_obscene":   round(scores["Obscene Language"], 4),
        "score_mature":    round(scores["Mature Content"], 4),
        "score_algospeak": round(scores["Algospeak"], 4),
        "timestamp":       datetime.utcnow().isoformat(),
    }
    with scheduler.lock:
        write_header = not LOG_FILE.exists()
        with open(LOG_FILE, "a", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=LOG_COLS)
            if write_header:
                writer.writeheader()
            writer.writerow(row)


# ─────────────────────────────────────────────────────────────────────
# CSS — makes the easter egg popover button invisible until hovered
# ─────────────────────────────────────────────────────────────────────

st.markdown("""
<style>
.easter-egg-col div[data-testid="stPopover"] button {
    opacity: 0.15;
    transition: opacity 0.3s ease;
    font-size: 28px;
    background: transparent;
    border: none;
    padding: 0;
    line-height: 1;
}
.easter-egg-col div[data-testid="stPopover"] button:hover {
    opacity: 0.85;
}
.easter-egg-col div[data-testid="stPopover"] button p {
    font-size: 28px !important;
}
</style>
""", unsafe_allow_html=True)


# ─────────────────────────────────────────────────────────────────────
# Header row — title left, easter egg right
# ─────────────────────────────────────────────────────────────────────

title_col, egg_col = st.columns([11, 1])

with title_col:
    st.title("Algospeak Classifier")
    st.caption("Dual BERTweet model · type a social media post to classify it.")

with egg_col:
    st.markdown('<div class="easter-egg-col">', unsafe_allow_html=True)
    with st.popover("🔬"):
        st.markdown(ABOUT_MD)
    st.markdown('</div>', unsafe_allow_html=True)


# ─────────────────────────────────────────────────────────────────────
# Main UI
# ─────────────────────────────────────────────────────────────────────

text = st.text_area("Post text", height=120, placeholder="Type something here...")

if st.button("Classify", type="primary") and text.strip():
    encoder, prototypes, tokenizer, cfg, device = load_model()
    result = classify_text(text, encoder, prototypes, tokenizer, cfg["max_length"], device, cfg["temperature"])

    label = result["predicted_label"]
    color = CLASS_COLORS[label]

    st.markdown(f"## :{color}[{label}]")
    st.divider()

    st.write("**Similarity scores:**")
    for name, score in sorted(result["scores"].items(), key=lambda x: -x[1]):
        st.progress(float(score), text=name)

    log_prediction(text, result)