Spaces:
Running
Running
File size: 14,188 Bytes
1ce3127 8cf9caf 1ce3127 8cf9caf 1ce3127 8cf9caf 1ce3127 8cf9caf 1ce3127 8cf9caf f11fe9d 1ce3127 f11fe9d 1ce3127 f11fe9d 1ce3127 8cf9caf a3e51e2 8cf9caf a3e51e2 f11fe9d 8cf9caf f11fe9d 8cf9caf 1ce3127 f11fe9d 1ce3127 f11fe9d 4fb5b02 f11fe9d 4fb5b02 f11fe9d 4fb5b02 f11fe9d 4fb5b02 541189f f11fe9d 4fb5b02 f11fe9d 1ce3127 f11fe9d 1ce3127 35dc479 1ce3127 35dc479 8cf9caf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 | """
app.py β Algospeak Classifier demo
Streamlit UI for the dual BERTweet model.
Type a social media post and see the predicted class + confidence scores.
Predictions are logged to a private HF dataset repo via CommitScheduler.
"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent / "poc" / "src"))
import csv
import yaml
import torch
import numpy as np
import emoji
import streamlit as st
from datetime import datetime
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download, CommitScheduler
from inference import load_unsupervised_encoder, classify_text
BASE_DIR = Path(__file__).parent
MODEL_REPO = "timagonch/algospeak-classifier-model"
LOG_REPO = "timagonch/algospeak-logs"
LOG_DIR = BASE_DIR / "logs"
LOG_FILE = LOG_DIR / "predictions.csv"
LOG_COLS = ["text", "predicted_label", "score_allowed", "score_obscene", "score_mature", "score_algospeak", "timestamp"]
CLASS_COLORS = {
"Allowed": "green",
"Obscene Language": "red",
"Mature Content": "orange",
"Algospeak": "violet",
}
ABOUT_MD = """
## Algospeak Classifier β Project Overview
This tool is the result of a semester-long research project exploring **algospeak detection** as part of a content moderation pipeline for social media. The goal was to classify posts not just by whether they contain harmful content, but by *how* that content is expressed β including coded language specifically designed to evade automated filters.
---
### What is Algospeak?
Algospeak is a form of linguistic camouflage that emerged organically on platforms like TikTok, Bluesky, and Twitter/X. When users learn that certain words trigger automated takedowns, they develop workarounds β substitutions that carry the same meaning but bypass keyword filters:
- **"unalive"** instead of suicide or self-harm
- **"corn"** for explicit sexual content
- **"k!ll", "k1ll", "k.i.l.l"** for violence
- Phonetic swaps (e.g. "seggs"), emoji substitutions, abbreviations, repurposed innocent words
The challenge is that these substitutions evolve constantly, vary by community, and are nearly impossible to keep up with using hand-crafted rules. The only durable solution is a model that understands *intent* from context.
---
### Architecture
The model is a **Dual BERTweet** network β two separate BERTweet encoders (vinai/bertweet-base, 270M parameters each) trained jointly with a contrastive learning objective called Supervised InfoNCE:
- **Supervised encoder** β receives label-prefixed text during training (e.g. `"Algospeak: gonna unalive myself"`). Acts as a teacher by injecting class identity directly into the text.
- **Unsupervised encoder** β receives raw text only, and is trained to match the supervised encoder's embedding space via the InfoNCE loss.
After training, the supervised encoder is discarded entirely. At inference, the unsupervised encoder embeds an incoming post and compares it via cosine similarity against four **class prototypes** β the average embedding per class computed from the training set. The nearest prototype wins. The algospeak prototype uses inverse deny-term frequency weighting so rarer coded forms aren't drowned out by common ones.
This approach was chosen specifically because it requires no rulesets, no exemplar lookup, and no deny list at inference time β just a single forward pass and a dot product.
---
### Data Collection & Manual Reclassification
The dataset was built from Bluesky social media posts collected by the team. Raw posts came in with initial labels, but those labels were noisy β so a careful manual re-review pass was done across the dataset.
To improve consistency on the class 1 and 2 boundary, **two deny lists** were built:
- `deny_list_class1.txt` β 115 terms covering slurs and hate speech
- `deny_list_class2.txt` β 521 terms covering explicit sexual content, drugs, and violence
A reclassification script applied deny-list hit logic: if a post contained a term from a list and had been labeled in the wrong class, it was overridden. This pass changed ~25,000 labels across the dataset, producing a cleaner `reclassified_final.csv` as the new source of truth.
---
### Synthetic Algospeak Generation
Class 3 (Algospeak) was by far the hardest class to collect naturally. Real algospeak examples are sparse and inconsistently labeled. To address this, a **GPT-4-turbo generation pipeline** was built that takes class 1 and 2 posts and transforms them into algospeak equivalents.
The pipeline used a 7-technique taxonomy grounded in documented community behavior:
character substitution, phonetic swaps, pictorial (emoji), abbreviation, repurposing of innocent words, paraphrase, and known community-specific terms. Each term was assigned a technique only if there was a documented example in a hints file β preventing the model from hallucinating plausible-but-wrong substitutions. A deny-term inflection detector ensured that forms like "stabbing" (not just "stab") were correctly passed to the generator.
This produced **13,264 algospeak pairs** (original + transformed), with the original post always kept in the same split as its algospeak counterpart to prevent leakage.
---
### Training Progression
The model went through several iterations as the dataset and architecture evolved:
**~10k/class β first dual BERTweet run (Apr 6)**
The 414-rule exemplar system was abandoned and replaced with the dual BERTweet architecture. The first full run used ~10,000 posts per class from the cleaned dataset, with a simple random split. Result: **test accuracy 79.9%**.
**~13k/class β group-aware split added (Apr 12)**
The dataset grew to ~13,300 posts per class using the full synthetic pairs. Critically, a **group-aware split** was introduced: original posts and their algospeak counterparts are always assigned to the same split. Without this, the model can train on a post and be evaluated on a near-identical transformed version β inflating results. With it: **test accuracy 85.9%**.
**~13k/class β weighted prototype + fix (Apr 13)**
The algospeak class prototype was upgraded to use inverse deny-term frequency weighting, giving rarer substitution forms more influence on the prototype center. A data loader fix was also applied. Result: **test accuracy 89.4%** β the best result on the full dataset.
**LLM audit & reclassification (Apr 16)**
A GPT-4o-mini audit reclassified ~39,000 posts from the existing splits. The LLM had stricter criteria for class 2 (Mature Content), which collapsed many borderline posts into class 0. This reduced class 2 to ~3,300 posts β a sharp drop from 13k β and the new splits had to be rebalanced much smaller. Result: **test accuracy 76.5%**. The bottleneck had shifted to class 2.
**3-class experiment (Apr 16)**
As a parallel track, classes 1 and 2 were merged into a single "Harmful Content" class, reducing the problem to 3 classes. With fewer boundaries to learn, the model performed strongly: **test accuracy 89.2%, Algospeak F1 = 93.8%**. This confirmed the architecture works well β the difficulty is class 1 vs 2 separation.
---
### Four-Class Controlled Experiment (This Model)
With the full dataset constrained by class 2 data scarcity, a focused experiment was run using a cleaner, smaller, more carefully curated subset of ~874 posts per class. The synthetic generation pipeline was rerun with stricter controls, producing 429 new algospeak examples. Two deny lists were merged into a single experiment-local list to avoid cross-contamination between class 1 and 2 deny terms.
#### Temperature Ablation
Temperature (Ο) controls the sharpness of the contrastive loss gradient. Lower Ο forces tighter clusters β which risks overfitting on small datasets. Higher Ο acts as regularization. Four runs were compared:
| Run | Ο | Test Acc | Macro F1 | Algospeak F1 | Mean AUC |
|-----|------|----------|----------|--------------|----------|
| 1 | 0.10 | 0.7918 | 0.7957 | 0.9032 | 0.9452 |
| 2 | 0.07 | 0.7214 | 0.7256 | 0.8138 | 0.8979 |
| **3 β** | **0.15** | **0.8065** | **0.8083** | **0.9045** | 0.9351 |
| 4 | 0.20 | 0.8240 | 0.8252 | 0.9161 | 0.9345 |
Run 4 (Ο=0.20) had the best aggregate numbers β but misclassified *"gonna unalive myself fr fr cant take this anymore"* as **Allowed**. That is one of the most well-documented suicide-related algospeak phrases in existence. A false negative on a phrase like that represents a worse failure than a 1.7% drop in overall accuracy, so **Ο=0.15 was chosen as the final model**.
---
### Final Model β Ο = 0.15
| Metric | Val | Test |
|---|---|---|
| Accuracy | 0.8642 | 0.8065 |
| Macro F1 | 0.8648 | 0.8083 |
| Mean AUC | 0.9600 | 0.9351 |
**Per-class test performance:**
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Allowed | 0.8065 | 0.8621 | 0.8333 |
| Obscene Language | 0.7363 | 0.7701 | 0.7528 |
| Mature Content | 0.7750 | 0.7126 | 0.7425 |
| Algospeak | 0.9221 | 0.8875 | **0.9045** |
Algospeak is the strongest class β which is the point. The remaining error is concentrated at the Obscene Language / Mature Content boundary, where surface vocabulary overlaps significantly (words like "rape" or "shoot" appear in both) and only broader context separates them.
---
*Built with BERTweet (VinAI), PyTorch, and Streamlit. Spring 2026.*
"""
@st.cache_resource(show_spinner="Loading model...")
def load_model():
with open(BASE_DIR / "poc" / "config.yaml") as f:
cfg = yaml.safe_load(f)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename="best_model.pt")
prototypes_path = hf_hub_download(repo_id=MODEL_REPO, filename="prototypes.npy")
encoder = load_unsupervised_encoder(checkpoint_path, cfg, device)
prototypes = np.load(prototypes_path)
tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"], use_fast=False)
return encoder, prototypes, tokenizer, cfg, device
@st.cache_resource
def get_scheduler():
import shutil
LOG_DIR.mkdir(exist_ok=True)
try:
existing = hf_hub_download(
repo_id=LOG_REPO,
filename="logs/predictions.csv",
repo_type="dataset",
)
shutil.copy(existing, LOG_FILE)
except Exception:
pass
return CommitScheduler(
repo_id=LOG_REPO,
repo_type="dataset",
folder_path=LOG_DIR,
path_in_repo="logs",
every=5,
)
def log_prediction(text, result):
scheduler = get_scheduler()
scores = result["scores"]
row = {
"text": text,
"predicted_label": result["predicted_label"],
"score_allowed": round(scores["Allowed"], 4),
"score_obscene": round(scores["Obscene Language"], 4),
"score_mature": round(scores["Mature Content"], 4),
"score_algospeak": round(scores["Algospeak"], 4),
"timestamp": datetime.utcnow().isoformat(),
}
with scheduler.lock:
write_header = not LOG_FILE.exists()
with open(LOG_FILE, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=LOG_COLS)
if write_header:
writer.writeheader()
writer.writerow(row)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# CSS β makes the easter egg popover button invisible until hovered
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.markdown("""
<style>
.easter-egg-col div[data-testid="stPopover"] button {
opacity: 0.15;
transition: opacity 0.3s ease;
font-size: 28px;
background: transparent;
border: none;
padding: 0;
line-height: 1;
}
.easter-egg-col div[data-testid="stPopover"] button:hover {
opacity: 0.85;
}
.easter-egg-col div[data-testid="stPopover"] button p {
font-size: 28px !important;
}
</style>
""", unsafe_allow_html=True)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Header row β title left, easter egg right
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
title_col, egg_col = st.columns([11, 1])
with title_col:
st.title("Algospeak Classifier")
st.caption("Dual BERTweet model Β· type a social media post to classify it.")
with egg_col:
st.markdown('<div class="easter-egg-col">', unsafe_allow_html=True)
with st.popover("π¬"):
st.markdown(ABOUT_MD)
st.markdown('</div>', unsafe_allow_html=True)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Main UI
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
text = st.text_area("Post text", height=120, placeholder="Type something here...")
if st.button("Classify", type="primary") and text.strip():
encoder, prototypes, tokenizer, cfg, device = load_model()
result = classify_text(text, encoder, prototypes, tokenizer, cfg["max_length"], device, cfg["temperature"])
label = result["predicted_label"]
color = CLASS_COLORS[label]
st.markdown(f"## :{color}[{label}]")
st.divider()
st.write("**Similarity scores:**")
for name, score in sorted(result["scores"].items(), key=lambda x: -x[1]):
st.progress(float(score), text=name)
log_prediction(text, result)
|