🛡️ AlgoShield : Cross-Platform Algospeak & Toxicity Detection

Fine-tuned DistilBERT for robust detection of evasive toxic language across decentralized social media platforms. Trained on Reddit + Koo, evaluated on entirely unseen platforms (Bluesky + Voat) , achieving a +107% improvement in Recall over the untuned baseline.

🧠 Model Description

Standard toxicity classifiers fail when users exploit Algospeak : intentional obfuscation techniques designed to evade automated moderation:

🔤 Leet encoding -replacing letters with numbers/symbols (e.g., n1gg3r)
🔊 Phonetic distortion -stretched spellings (e.g., gheyyy, wh*re)
📊 Statistical framing -disguising hate as factual claims
😶 Implicit toxicity -hostile intent with no surface profanity

AlgoShield addresses this through domain-adaptive fine-tuning of martin-ha/toxic-comment-model using a Toxicity-Balanced Stratified Sampling strategy ,ensuring uniform coverage across 10 fine-grained toxicity intensity bins and 2 training platforms.

📊 Performance

Out-of-Domain Test Set (Bluesky + Voat - 98,455 samples, never seen during training)

Model	Accuracy	Precision	Recall	F1
Baseline (`martin-ha/toxic-comment-model`)	59.0%	70.3%	33.2%	45.1%
AlgoShield (ours)	62.8%	61.2%	73.2%	66.7%

📌 In-domain validation (Reddit + Koo): Acc=67.5%, Prec=64.8%, Rec=76.6%, F1=70.2%

Ablation — Effect of Length Balancing

Sampling Strategy	Val F1	Test F1	Val→Test Gap
✅ Tox-balanced only (Exp 1, this model)	70.2%	66.7%	3.5pt
Tox + Length balanced (Exp 2)	66.3%	66.0%	0.3pt

🗂️ Training Data

Dataset: MADOC (Multi-Platform Aggregated Dataset of Online Communities)

Split	Platforms	Samples	Role
Train	Reddit + Koo	90,000	In-domain training
Validation	Reddit + Koo	10,000	In-domain evaluation
Test	Bluesky + Voat	98,455	⚠️ Out-of-domain evaluation

Sampling Strategy : Toxicity-Balanced Stratified Sampling

Raw social media data is heavily skewed toward benign content. A naive sample would give the model almost no exposure to high-toxicity posts. To fix this:

Toxicity scores (0.0–1.0) are discretized into 10 equal-width bins
An equal number of samples is drawn from each bin × each platform
This ensures the model sees the full spectrum of toxicity intensity - from borderline posts (bin 1–2) to extreme content (bin 9–10)

10 bins × 2 platforms × 4,500 train samples = 90,000 train
10 bins × 2 platforms × 500 val samples     = 10,000 validation
Both splits: 50/50 toxic vs. non-toxic

Platform Characteristics

Platform	Type	Moderation	Toxicity Profile
🟠 Reddit	Forum	Moderate	Diverse, community-dependent
🔵 Koo	Microblog	Moderate	Mixed, multilingual
🌊 Bluesky	Microblog	Minimal	Short posts, decentralized
⚫ Voat	Forum	None	High toxicity, explicit hate speech

⚠️ Bluesky and Voat were never seen during training , they serve purely as out-of-domain test platforms.

⚙️ Training Details

Parameter	Value
🤗 Base model	`martin-ha/toxic-comment-model`
🏗️ Architecture	DistilBERT (6 layers, 768 hidden dim, 12 heads)
📚 Training samples	90,000 (toxicity-balanced)
📐 Learning rate	2e-5 with linear warmup (ratio=0.06)
📦 Batch size	8 per GPU × gradient accumulation 2 = effective 16
⏱️ Max epochs	10 with early stopping (patience=3)
🏆 Best checkpoint	Epoch 4 - checkpoint-22500
🛑 Training stopped	Epoch 7 (no improvement for 3 consecutive epochs)
⏳ Training time	~4.8 hours on GPU
🌱 Seed	42

🔍 Qualitative Analysis Highlights

Fine-tuning resolved 93 False Negatives the baseline missed entirely:

🏷️ Type	Example (Abridged)	Baseline	AlgoShield
Algospeak (phonetic)	"...pedo daycare... gheyyy"	0.17	0.95
Leet-encoded slur	"pass as a n[---]er"	0.09	0.89
Body-shaming framing	"Fat Americans are liars..."	0.27	0.89
Predatory content	"children are the sex toy payments..."	0.06	0.84
Implicit threat	"deserves curbstomping"	0.09	0.71
Implicit attack	"insufferable douchebag? Drive Rivian"	0.01	0.73

💻 Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="odeliyach/AlgoShield-Algospeak-Detection"
)

# Single text
result = classifier("This is a test sentence")
print(result)
# [{'label': 'toxic', 'score': 0.87}]

# Batch inference
texts = [
    "You need to pass as a n[---]er",
    "Have a great day!",
    "deserves curbstomping"
]
results = classifier(texts)

⚠️ Limitations

Emoji-encoded Algospeak : the model may not detect emoji-based evasion patterns
Temporal drift : Algospeak evolves rapidly; performance may degrade on newly coined evasion terms not present in MADOC
Platform bias : trained on Reddit/Koo norms; may require further fine-tuning for platforms with very different linguistic conventions
Precision trade-off : the model is optimized for Recall (catching toxic content); expect more false positives than the baseline

📎 Citation

@misc{algoshield2026,
  title   = {AlgoShield: Cross-Platform Algospeak Detection via Domain-Adapted DistilBERT},
  author  = {Charitonova, Odeliya and Loshevsky, Alin and Pernik, Lior},
  year    = {2026},
  note    = {NLP Final Project, Tel Aviv University},
  url     = {https://github.com/odeliyach/AlgoShield-Algospeak-Detection}
}

🔗 Links


📦 Code & Results	GitHub — odeliyach/AlgoShield-Algospeak-Detection
🗃️ MADOC Dataset	Zenodo
🤗 Base Model	martin-ha/toxic-comment-model
📄 Full Paper	(link TBD after submission)

Downloads last month: 11

Safetensors

Model size

67M params

Tensor type

F32

Model tree for odeliyach/AlgoShield-Algospeak-Detection

Base model

martin-ha/toxic-comment-model

Finetuned

(2)

this model

odeliyach
/

AlgoShield-Algospeak-Detection