π‘οΈ AlgoShield : Cross-Platform Algospeak & Toxicity Detection
Fine-tuned DistilBERT for robust detection of evasive toxic language across decentralized social media platforms. Trained on Reddit + Koo, evaluated on entirely unseen platforms (Bluesky + Voat) , achieving a +107% improvement in Recall over the untuned baseline.
π§ Model Description
Standard toxicity classifiers fail when users exploit Algospeak : intentional obfuscation techniques designed to evade automated moderation:
- π€ Leet encoding -replacing letters with numbers/symbols (e.g.,
n1gg3r) - π Phonetic distortion -stretched spellings (e.g.,
gheyyy,wh*re) - π Statistical framing -disguising hate as factual claims
- πΆ Implicit toxicity -hostile intent with no surface profanity
AlgoShield addresses this through domain-adaptive fine-tuning of martin-ha/toxic-comment-model using a Toxicity-Balanced Stratified Sampling strategy ,ensuring uniform coverage across 10 fine-grained toxicity intensity bins and 2 training platforms.
π Performance
Out-of-Domain Test Set (Bluesky + Voat - 98,455 samples, never seen during training)
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
Baseline (martin-ha/toxic-comment-model) |
59.0% | 70.3% | 33.2% | 45.1% |
| AlgoShield (ours) | 62.8% | 61.2% | 73.2% | 66.7% |
π In-domain validation (Reddit + Koo): Acc=67.5%, Prec=64.8%, Rec=76.6%, F1=70.2%
Ablation β Effect of Length Balancing
| Sampling Strategy | Val F1 | Test F1 | ValβTest Gap |
|---|---|---|---|
| β Tox-balanced only (Exp 1, this model) | 70.2% | 66.7% | 3.5pt |
| Tox + Length balanced (Exp 2) | 66.3% | 66.0% | 0.3pt |
ποΈ Training Data
Dataset: MADOC (Multi-Platform Aggregated Dataset of Online Communities)
| Split | Platforms | Samples | Role |
|---|---|---|---|
| Train | Reddit + Koo | 90,000 | In-domain training |
| Validation | Reddit + Koo | 10,000 | In-domain evaluation |
| Test | Bluesky + Voat | 98,455 | β οΈ Out-of-domain evaluation |
Sampling Strategy : Toxicity-Balanced Stratified Sampling
Raw social media data is heavily skewed toward benign content. A naive sample would give the model almost no exposure to high-toxicity posts. To fix this:
- Toxicity scores (0.0β1.0) are discretized into 10 equal-width bins
- An equal number of samples is drawn from each bin Γ each platform
- This ensures the model sees the full spectrum of toxicity intensity - from borderline posts (bin 1β2) to extreme content (bin 9β10)
10 bins Γ 2 platforms Γ 4,500 train samples = 90,000 train
10 bins Γ 2 platforms Γ 500 val samples = 10,000 validation
Both splits: 50/50 toxic vs. non-toxic
Platform Characteristics
| Platform | Type | Moderation | Toxicity Profile |
|---|---|---|---|
| π Reddit | Forum | Moderate | Diverse, community-dependent |
| π΅ Koo | Microblog | Moderate | Mixed, multilingual |
| π Bluesky | Microblog | Minimal | Short posts, decentralized |
| β« Voat | Forum | None | High toxicity, explicit hate speech |
β οΈ Bluesky and Voat were never seen during training , they serve purely as out-of-domain test platforms.
Preprocessing
- URL removal
- Emoji stripping
- Minimum length filtering (β₯ 10 characters)
- Tokenization: DistilBERT WordPiece, max length 512 tokens
βοΈ Training Details
| Parameter | Value |
|---|---|
| π€ Base model | martin-ha/toxic-comment-model |
| ποΈ Architecture | DistilBERT (6 layers, 768 hidden dim, 12 heads) |
| π Training samples | 90,000 (toxicity-balanced) |
| π Learning rate | 2e-5 with linear warmup (ratio=0.06) |
| π¦ Batch size | 8 per GPU Γ gradient accumulation 2 = effective 16 |
| β±οΈ Max epochs | 10 with early stopping (patience=3) |
| π Best checkpoint | Epoch 4 - checkpoint-22500 |
| π Training stopped | Epoch 7 (no improvement for 3 consecutive epochs) |
| β³ Training time | ~4.8 hours on GPU |
| π± Seed | 42 |
π Qualitative Analysis Highlights
Fine-tuning resolved 93 False Negatives the baseline missed entirely:
| π·οΈ Type | Example (Abridged) | Baseline | AlgoShield |
|---|---|---|---|
| Algospeak (phonetic) | "...pedo daycare... gheyyy" | 0.17 | 0.95 |
| Leet-encoded slur | "pass as a n[---]er" | 0.09 | 0.89 |
| Body-shaming framing | "Fat Americans are liars..." | 0.27 | 0.89 |
| Predatory content | "children are the sex toy payments..." | 0.06 | 0.84 |
| Implicit threat | "deserves curbstomping" | 0.09 | 0.71 |
| Implicit attack | "insufferable douchebag? Drive Rivian" | 0.01 | 0.73 |
π» Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="odeliyach/AlgoShield-Algospeak-Detection"
)
# Single text
result = classifier("This is a test sentence")
print(result)
# [{'label': 'toxic', 'score': 0.87}]
# Batch inference
texts = [
"You need to pass as a n[---]er",
"Have a great day!",
"deserves curbstomping"
]
results = classifier(texts)
β οΈ Limitations
- Emoji-encoded Algospeak : emojis were stripped during preprocessing; the model may not detect emoji-based evasion patterns
- Temporal drift : Algospeak evolves rapidly; performance may degrade on newly coined evasion terms not present in MADOC
- Platform bias : trained on Reddit/Koo norms; may require further fine-tuning for platforms with very different linguistic conventions
- Precision trade-off : the model is optimized for Recall (catching toxic content); expect more false positives than the baseline
π Citation
@misc{algoshield2026,
title = {AlgoShield: Cross-Platform Algospeak Detection via Domain-Adapted DistilBERT},
author = {Charitonova, Odeliya and Loshevsky, Alin and Pernik, Lior},
year = {2026},
note = {NLP Final Project, Tel Aviv University},
url = {https://github.com/odeliyach/AlgoShield-Algospeak-Detection}
}
π Links
| π¦ Code & Results | GitHub β odeliyach/AlgoShield-Algospeak-Detection |
| ποΈ MADOC Dataset | Zenodo |
| π€ Base Model | martin-ha/toxic-comment-model |
| π Full Paper | (link TBD after submission) |
- Downloads last month
- 367
Model tree for odeliyach/AlgoShield-Algospeak-Detection
Base model
martin-ha/toxic-comment-model