Spaces:
Running
Running
| """ | |
| app.py β Algospeak Classifier demo | |
| Streamlit UI for the dual BERTweet model. | |
| Type a social media post and see the predicted class + confidence scores. | |
| Predictions are logged to a private HF dataset repo via CommitScheduler. | |
| """ | |
| import sys | |
| from pathlib import Path | |
| sys.path.insert(0, str(Path(__file__).parent / "poc" / "src")) | |
| import csv | |
| import yaml | |
| import torch | |
| import numpy as np | |
| import emoji | |
| import streamlit as st | |
| from datetime import datetime | |
| from transformers import AutoTokenizer | |
| from huggingface_hub import hf_hub_download, CommitScheduler | |
| from inference import load_unsupervised_encoder, classify_text | |
| BASE_DIR = Path(__file__).parent | |
| MODEL_REPO = "timagonch/algospeak-classifier-model" | |
| LOG_REPO = "timagonch/algospeak-logs" | |
| LOG_DIR = BASE_DIR / "logs" | |
| LOG_FILE = LOG_DIR / "predictions.csv" | |
| LOG_COLS = ["text", "predicted_label", "score_allowed", "score_obscene", "score_mature", "score_algospeak", "timestamp"] | |
| CLASS_COLORS = { | |
| "Allowed": "green", | |
| "Obscene Language": "red", | |
| "Mature Content": "orange", | |
| "Algospeak": "violet", | |
| } | |
| ABOUT_MD = """ | |
| ## Algospeak Classifier β Project Overview | |
| This tool is the result of a semester-long research project exploring **algospeak detection** as part of a content moderation pipeline for social media. The goal was to classify posts not just by whether they contain harmful content, but by *how* that content is expressed β including coded language specifically designed to evade automated filters. | |
| --- | |
| ### What is Algospeak? | |
| Algospeak is a form of linguistic camouflage that emerged organically on platforms like TikTok, Bluesky, and Twitter/X. When users learn that certain words trigger automated takedowns, they develop workarounds β substitutions that carry the same meaning but bypass keyword filters: | |
| - **"unalive"** instead of suicide or self-harm | |
| - **"corn"** for explicit sexual content | |
| - **"k!ll", "k1ll", "k.i.l.l"** for violence | |
| - Phonetic swaps (e.g. "seggs"), emoji substitutions, abbreviations, repurposed innocent words | |
| The challenge is that these substitutions evolve constantly, vary by community, and are nearly impossible to keep up with using hand-crafted rules. The only durable solution is a model that understands *intent* from context. | |
| --- | |
| ### Architecture | |
| The model is a **Dual BERTweet** network β two separate BERTweet encoders (vinai/bertweet-base, 270M parameters each) trained jointly with a contrastive learning objective called Supervised InfoNCE: | |
| - **Supervised encoder** β receives label-prefixed text during training (e.g. `"Algospeak: gonna unalive myself"`). Acts as a teacher by injecting class identity directly into the text. | |
| - **Unsupervised encoder** β receives raw text only, and is trained to match the supervised encoder's embedding space via the InfoNCE loss. | |
| After training, the supervised encoder is discarded entirely. At inference, the unsupervised encoder embeds an incoming post and compares it via cosine similarity against four **class prototypes** β the average embedding per class computed from the training set. The nearest prototype wins. The algospeak prototype uses inverse deny-term frequency weighting so rarer coded forms aren't drowned out by common ones. | |
| This approach was chosen specifically because it requires no rulesets, no exemplar lookup, and no deny list at inference time β just a single forward pass and a dot product. | |
| --- | |
| ### Data Collection & Manual Reclassification | |
| The dataset was built from Bluesky social media posts collected by the team. Raw posts came in with initial labels, but those labels were noisy β so a careful manual re-review pass was done across the dataset. | |
| To improve consistency on the class 1 and 2 boundary, **two deny lists** were built: | |
| - `deny_list_class1.txt` β 115 terms covering slurs and hate speech | |
| - `deny_list_class2.txt` β 521 terms covering explicit sexual content, drugs, and violence | |
| A reclassification script applied deny-list hit logic: if a post contained a term from a list and had been labeled in the wrong class, it was overridden. This pass changed ~25,000 labels across the dataset, producing a cleaner `reclassified_final.csv` as the new source of truth. | |
| --- | |
| ### Synthetic Algospeak Generation | |
| Class 3 (Algospeak) was by far the hardest class to collect naturally. Real algospeak examples are sparse and inconsistently labeled. To address this, a **GPT-4-turbo generation pipeline** was built that takes class 1 and 2 posts and transforms them into algospeak equivalents. | |
| The pipeline used a 7-technique taxonomy grounded in documented community behavior: | |
| character substitution, phonetic swaps, pictorial (emoji), abbreviation, repurposing of innocent words, paraphrase, and known community-specific terms. Each term was assigned a technique only if there was a documented example in a hints file β preventing the model from hallucinating plausible-but-wrong substitutions. A deny-term inflection detector ensured that forms like "stabbing" (not just "stab") were correctly passed to the generator. | |
| This produced **13,264 algospeak pairs** (original + transformed), with the original post always kept in the same split as its algospeak counterpart to prevent leakage. | |
| --- | |
| ### Training Progression | |
| The model went through several iterations as the dataset and architecture evolved: | |
| **~10k/class β first dual BERTweet run (Apr 6)** | |
| The 414-rule exemplar system was abandoned and replaced with the dual BERTweet architecture. The first full run used ~10,000 posts per class from the cleaned dataset, with a simple random split. Result: **test accuracy 79.9%**. | |
| **~13k/class β group-aware split added (Apr 12)** | |
| The dataset grew to ~13,300 posts per class using the full synthetic pairs. Critically, a **group-aware split** was introduced: original posts and their algospeak counterparts are always assigned to the same split. Without this, the model can train on a post and be evaluated on a near-identical transformed version β inflating results. With it: **test accuracy 85.9%**. | |
| **~13k/class β weighted prototype + fix (Apr 13)** | |
| The algospeak class prototype was upgraded to use inverse deny-term frequency weighting, giving rarer substitution forms more influence on the prototype center. A data loader fix was also applied. Result: **test accuracy 89.4%** β the best result on the full dataset. | |
| **LLM audit & reclassification (Apr 16)** | |
| A GPT-4o-mini audit reclassified ~39,000 posts from the existing splits. The LLM had stricter criteria for class 2 (Mature Content), which collapsed many borderline posts into class 0. This reduced class 2 to ~3,300 posts β a sharp drop from 13k β and the new splits had to be rebalanced much smaller. Result: **test accuracy 76.5%**. The bottleneck had shifted to class 2. | |
| **3-class experiment (Apr 16)** | |
| As a parallel track, classes 1 and 2 were merged into a single "Harmful Content" class, reducing the problem to 3 classes. With fewer boundaries to learn, the model performed strongly: **test accuracy 89.2%, Algospeak F1 = 93.8%**. This confirmed the architecture works well β the difficulty is class 1 vs 2 separation. | |
| --- | |
| ### Four-Class Controlled Experiment (This Model) | |
| With the full dataset constrained by class 2 data scarcity, a focused experiment was run using a cleaner, smaller, more carefully curated subset of ~874 posts per class. The synthetic generation pipeline was rerun with stricter controls, producing 429 new algospeak examples. Two deny lists were merged into a single experiment-local list to avoid cross-contamination between class 1 and 2 deny terms. | |
| #### Temperature Ablation | |
| Temperature (Ο) controls the sharpness of the contrastive loss gradient. Lower Ο forces tighter clusters β which risks overfitting on small datasets. Higher Ο acts as regularization. Four runs were compared: | |
| | Run | Ο | Test Acc | Macro F1 | Algospeak F1 | Mean AUC | | |
| |-----|------|----------|----------|--------------|----------| | |
| | 1 | 0.10 | 0.7918 | 0.7957 | 0.9032 | 0.9452 | | |
| | 2 | 0.07 | 0.7214 | 0.7256 | 0.8138 | 0.8979 | | |
| | **3 β** | **0.15** | **0.8065** | **0.8083** | **0.9045** | 0.9351 | | |
| | 4 | 0.20 | 0.8240 | 0.8252 | 0.9161 | 0.9345 | | |
| Run 4 (Ο=0.20) had the best aggregate numbers β but misclassified *"gonna unalive myself fr fr cant take this anymore"* as **Allowed**. That is one of the most well-documented suicide-related algospeak phrases in existence. A false negative on a phrase like that represents a worse failure than a 1.7% drop in overall accuracy, so **Ο=0.15 was chosen as the final model**. | |
| --- | |
| ### Final Model β Ο = 0.15 | |
| | Metric | Val | Test | | |
| |---|---|---| | |
| | Accuracy | 0.8642 | 0.8065 | | |
| | Macro F1 | 0.8648 | 0.8083 | | |
| | Mean AUC | 0.9600 | 0.9351 | | |
| **Per-class test performance:** | |
| | Class | Precision | Recall | F1 | | |
| |---|---|---|---| | |
| | Allowed | 0.8065 | 0.8621 | 0.8333 | | |
| | Obscene Language | 0.7363 | 0.7701 | 0.7528 | | |
| | Mature Content | 0.7750 | 0.7126 | 0.7425 | | |
| | Algospeak | 0.9221 | 0.8875 | **0.9045** | | |
| Algospeak is the strongest class β which is the point. The remaining error is concentrated at the Obscene Language / Mature Content boundary, where surface vocabulary overlaps significantly (words like "rape" or "shoot" appear in both) and only broader context separates them. | |
| --- | |
| *Built with BERTweet (VinAI), PyTorch, and Streamlit. Spring 2026.* | |
| """ | |
| def load_model(): | |
| with open(BASE_DIR / "poc" / "config.yaml") as f: | |
| cfg = yaml.safe_load(f) | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| checkpoint_path = hf_hub_download(repo_id=MODEL_REPO, filename="best_model.pt") | |
| prototypes_path = hf_hub_download(repo_id=MODEL_REPO, filename="prototypes.npy") | |
| encoder = load_unsupervised_encoder(checkpoint_path, cfg, device) | |
| prototypes = np.load(prototypes_path) | |
| tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"], use_fast=False) | |
| return encoder, prototypes, tokenizer, cfg, device | |
| def get_scheduler(): | |
| import shutil | |
| LOG_DIR.mkdir(exist_ok=True) | |
| try: | |
| existing = hf_hub_download( | |
| repo_id=LOG_REPO, | |
| filename="logs/predictions.csv", | |
| repo_type="dataset", | |
| ) | |
| shutil.copy(existing, LOG_FILE) | |
| except Exception: | |
| pass | |
| return CommitScheduler( | |
| repo_id=LOG_REPO, | |
| repo_type="dataset", | |
| folder_path=LOG_DIR, | |
| path_in_repo="logs", | |
| every=5, | |
| ) | |
| def log_prediction(text, result): | |
| scheduler = get_scheduler() | |
| scores = result["scores"] | |
| row = { | |
| "text": text, | |
| "predicted_label": result["predicted_label"], | |
| "score_allowed": round(scores["Allowed"], 4), | |
| "score_obscene": round(scores["Obscene Language"], 4), | |
| "score_mature": round(scores["Mature Content"], 4), | |
| "score_algospeak": round(scores["Algospeak"], 4), | |
| "timestamp": datetime.utcnow().isoformat(), | |
| } | |
| with scheduler.lock: | |
| write_header = not LOG_FILE.exists() | |
| with open(LOG_FILE, "a", newline="", encoding="utf-8") as f: | |
| writer = csv.DictWriter(f, fieldnames=LOG_COLS) | |
| if write_header: | |
| writer.writeheader() | |
| writer.writerow(row) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # CSS β makes the easter egg popover button invisible until hovered | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown(""" | |
| <style> | |
| .easter-egg-col div[data-testid="stPopover"] button { | |
| opacity: 0.15; | |
| transition: opacity 0.3s ease; | |
| font-size: 28px; | |
| background: transparent; | |
| border: none; | |
| padding: 0; | |
| line-height: 1; | |
| } | |
| .easter-egg-col div[data-testid="stPopover"] button:hover { | |
| opacity: 0.85; | |
| } | |
| .easter-egg-col div[data-testid="stPopover"] button p { | |
| font-size: 28px !important; | |
| } | |
| </style> | |
| """, unsafe_allow_html=True) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Header row β title left, easter egg right | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| title_col, egg_col = st.columns([11, 1]) | |
| with title_col: | |
| st.title("Algospeak Classifier") | |
| st.caption("Dual BERTweet model Β· type a social media post to classify it.") | |
| with egg_col: | |
| st.markdown('<div class="easter-egg-col">', unsafe_allow_html=True) | |
| with st.popover("π¬"): | |
| st.markdown(ABOUT_MD) | |
| st.markdown('</div>', unsafe_allow_html=True) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Main UI | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| text = st.text_area("Post text", height=120, placeholder="Type something here...") | |
| if st.button("Classify", type="primary") and text.strip(): | |
| encoder, prototypes, tokenizer, cfg, device = load_model() | |
| result = classify_text(text, encoder, prototypes, tokenizer, cfg["max_length"], device, cfg["temperature"]) | |
| label = result["predicted_label"] | |
| color = CLASS_COLORS[label] | |
| st.markdown(f"## :{color}[{label}]") | |
| st.divider() | |
| st.write("**Similarity scores:**") | |
| for name, score in sorted(result["scores"].items(), key=lambda x: -x[1]): | |
| st.progress(float(score), text=name) | |
| log_prediction(text, result) | |