πŸ›‘οΈ AlgoShield : Cross-Platform Algospeak & Toxicity Detection

License: MIT Python 3.9+ HuggingFace Base Model Recall F1

Fine-tuned DistilBERT for robust detection of evasive toxic language across decentralized social media platforms. Trained on Reddit + Koo, evaluated on entirely unseen platforms (Bluesky + Voat) , achieving a +107% improvement in Recall over the untuned baseline.


🧠 Model Description

Standard toxicity classifiers fail when users exploit Algospeak : intentional obfuscation techniques designed to evade automated moderation:

  • πŸ”€ Leet encoding -replacing letters with numbers/symbols (e.g., n1gg3r)
  • πŸ”Š Phonetic distortion -stretched spellings (e.g., gheyyy, wh*re)
  • πŸ“Š Statistical framing -disguising hate as factual claims
  • 😢 Implicit toxicity -hostile intent with no surface profanity

AlgoShield addresses this through domain-adaptive fine-tuning of martin-ha/toxic-comment-model using a Toxicity-Balanced Stratified Sampling strategy ,ensuring uniform coverage across 10 fine-grained toxicity intensity bins and 2 training platforms.


πŸ“Š Performance

Out-of-Domain Test Set (Bluesky + Voat - 98,455 samples, never seen during training)

Model Accuracy Precision Recall F1
Baseline (martin-ha/toxic-comment-model) 59.0% 70.3% 33.2% 45.1%
AlgoShield (ours) 62.8% 61.2% 73.2% 66.7%

πŸ“Œ In-domain validation (Reddit + Koo): Acc=67.5%, Prec=64.8%, Rec=76.6%, F1=70.2%

Ablation β€” Effect of Length Balancing

Sampling Strategy Val F1 Test F1 Val→Test Gap
βœ… Tox-balanced only (Exp 1, this model) 70.2% 66.7% 3.5pt
Tox + Length balanced (Exp 2) 66.3% 66.0% 0.3pt

πŸ—‚οΈ Training Data

Dataset: MADOC (Multi-Platform Aggregated Dataset of Online Communities)

Split Platforms Samples Role
Train Reddit + Koo 90,000 In-domain training
Validation Reddit + Koo 10,000 In-domain evaluation
Test Bluesky + Voat 98,455 ⚠️ Out-of-domain evaluation

Sampling Strategy : Toxicity-Balanced Stratified Sampling

Raw social media data is heavily skewed toward benign content. A naive sample would give the model almost no exposure to high-toxicity posts. To fix this:

  1. Toxicity scores (0.0–1.0) are discretized into 10 equal-width bins
  2. An equal number of samples is drawn from each bin Γ— each platform
  3. This ensures the model sees the full spectrum of toxicity intensity - from borderline posts (bin 1–2) to extreme content (bin 9–10)
10 bins Γ— 2 platforms Γ— 4,500 train samples = 90,000 train
10 bins Γ— 2 platforms Γ— 500 val samples     = 10,000 validation
Both splits: 50/50 toxic vs. non-toxic

Platform Characteristics

Platform Type Moderation Toxicity Profile
🟠 Reddit Forum Moderate Diverse, community-dependent
πŸ”΅ Koo Microblog Moderate Mixed, multilingual
🌊 Bluesky Microblog Minimal Short posts, decentralized
⚫ Voat Forum None High toxicity, explicit hate speech

⚠️ Bluesky and Voat were never seen during training , they serve purely as out-of-domain test platforms.

Preprocessing

  • URL removal
  • Emoji stripping
  • Minimum length filtering (β‰₯ 10 characters)
  • Tokenization: DistilBERT WordPiece, max length 512 tokens

βš™οΈ Training Details

Parameter Value
πŸ€— Base model martin-ha/toxic-comment-model
πŸ—οΈ Architecture DistilBERT (6 layers, 768 hidden dim, 12 heads)
πŸ“š Training samples 90,000 (toxicity-balanced)
πŸ“ Learning rate 2e-5 with linear warmup (ratio=0.06)
πŸ“¦ Batch size 8 per GPU Γ— gradient accumulation 2 = effective 16
⏱️ Max epochs 10 with early stopping (patience=3)
πŸ† Best checkpoint Epoch 4 - checkpoint-22500
πŸ›‘ Training stopped Epoch 7 (no improvement for 3 consecutive epochs)
⏳ Training time ~4.8 hours on GPU
🌱 Seed 42

πŸ” Qualitative Analysis Highlights

Fine-tuning resolved 93 False Negatives the baseline missed entirely:

🏷️ Type Example (Abridged) Baseline AlgoShield
Algospeak (phonetic) "...pedo daycare... gheyyy" 0.17 0.95
Leet-encoded slur "pass as a n[---]er" 0.09 0.89
Body-shaming framing "Fat Americans are liars..." 0.27 0.89
Predatory content "children are the sex toy payments..." 0.06 0.84
Implicit threat "deserves curbstomping" 0.09 0.71
Implicit attack "insufferable douchebag? Drive Rivian" 0.01 0.73

πŸ’» Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="odeliyach/AlgoShield-Algospeak-Detection"
)

# Single text
result = classifier("This is a test sentence")
print(result)
# [{'label': 'toxic', 'score': 0.87}]

# Batch inference
texts = [
    "You need to pass as a n[---]er",
    "Have a great day!",
    "deserves curbstomping"
]
results = classifier(texts)

⚠️ Limitations

  • Emoji-encoded Algospeak : emojis were stripped during preprocessing; the model may not detect emoji-based evasion patterns
  • Temporal drift : Algospeak evolves rapidly; performance may degrade on newly coined evasion terms not present in MADOC
  • Platform bias : trained on Reddit/Koo norms; may require further fine-tuning for platforms with very different linguistic conventions
  • Precision trade-off : the model is optimized for Recall (catching toxic content); expect more false positives than the baseline

πŸ“Ž Citation

@misc{algoshield2026,
  title   = {AlgoShield: Cross-Platform Algospeak Detection via Domain-Adapted DistilBERT},
  author  = {Charitonova, Odeliya and Loshevsky, Alin and Pernik, Lior},
  year    = {2026},
  note    = {NLP Final Project, Tel Aviv University},
  url     = {https://github.com/odeliyach/AlgoShield-Algospeak-Detection}
}

πŸ”— Links

πŸ“¦ Code & Results GitHub β€” odeliyach/AlgoShield-Algospeak-Detection
πŸ—ƒοΈ MADOC Dataset Zenodo
πŸ€— Base Model martin-ha/toxic-comment-model
πŸ“„ Full Paper (link TBD after submission)
Downloads last month
367
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for odeliyach/AlgoShield-Algospeak-Detection

Finetuned
(2)
this model

Space using odeliyach/AlgoShield-Algospeak-Detection 1