SafeShield AI β Multilingual Hate Speech Detection
Model Description
SafeShield AI is a fine-tuned MuRIL (Multilingual Representations for Indian Languages) model for detecting hate speech and offensive content in Hinglish (Hindi-English code-mixed) text.
Most hate speech classifiers are trained on monolingual English, but a significant volume of online abuse in India occurs in Hinglish β which standard tokenizers and models handle poorly. This model addresses that gap.
Model Architecture
Input Text (Hinglish/English)
β
MuRIL Tokenizer (vocab: 197k tokens, covers Devanagari + Latin)
β
MuRIL-base-cased (12 layers, 768 hidden, 12 heads, 237M params)
β
[CLS] token embedding (768-dim)
β
Dropout (0.1) + Linear(768 β 2)
β
Softmax β [P(NOT), P(HOF)]
Labels
| Label | ID | Meaning |
|---|---|---|
| NOT | 0 | Non-hate, non-offensive content |
| HOF | 1 | Hate speech or Offensive content |
Training Details
Dataset
- Source: Code-Mixed Hinglish Hate Speech Detection Dataset
- Size: 29,533 samples (after cleaning)
- Split: 70% train / 10% val / 20% test
- Class balance: NOT=53.7% / HOF=46.3% (nearly balanced, 1.16x ratio)
- Languages: Hinglish, English, mixed
Preprocessing Pipeline
- URL removal
- @mention removal
- Hashtag normalization (#BJP β BJP)
- Emoji demojization (π β :face_with_tears_of_joy:)
- Repeated character normalization (sooo β soo)
- Repeated punctuation normalization (!!! β !)
- Lowercasing
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | google/muril-base-cased |
| Max sequence length | 128 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Epochs | 4 (early stop at 3) |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Scheduler | Linear warmup + decay |
| Loss | CrossEntropyLoss (balanced weights) |
Training Infrastructure
- Platform: Kaggle (free tier)
- GPU: Tesla T4 x2
- Training time: ~25 minutes
Evaluation Results
Test Set Performance (5,907 samples)
| Metric | NOT | HOF | Macro Avg |
|---|---|---|---|
| Precision | 0.7351 | 0.7444 | 0.7397 |
| Recall | 0.8015 | 0.6668 | 0.7342 |
| F1-Score | 0.7669 | 0.6974 | 0.7352 |
| Support | 3,164 | 2,743 | 5,907 |
Overall Accuracy: 73.90%
Comparison with Baseline
| Model | Macro F1 | Notes |
|---|---|---|
| TF-IDF + Logistic Regression | 0.7138 | Baseline |
| MuRIL Fine-tuned (ours) | 0.7352 | +3.0% relative gain |
Confusion Matrix
Predicted NOT HOF Actual NOT [ 2513 651 ] HOF [ 926 1817 ]
Error Analysis & Known Limitations
Error Breakdown
| Type | Count | Rate | Description |
|---|---|---|---|
| True Positive | 1,817 | - | Correctly caught hate |
| True Negative | 2,513 | - | Correctly passed clean |
| False Negative | 926 | 33.76% | Missed hate speech β most dangerous |
| False Positive | 651 | 20.58% | Over-flagged clean content |
Known Failure Modes
1. Implicit/Subtle Hate Speech The model struggles with hate speech that doesn't use explicit slurs or offensive vocabulary β sarcasm, dog whistles, and context-dependent toxicity are frequently missed.
2. Very Short Texts Texts under 50 characters have a 41.8% false positive rate. With minimal context, the model over-flags based on surface patterns.
3. Non-Hinglish Languages Croatian, Serbian, and other non-Hinglish texts are sometimes misclassified as hate speech despite MuRIL's multilingual training.
4. Medium-Length Hate Speech (101-150 chars) The worst false negative rate (48.7%) occurs in the 101-150 character range β nearly half of hate speech in this length is missed.
5. Confidence Calibration
- Below 0.60 confidence: only 56-66% accurate (borderline territory)
- Above 0.65 confidence: 87-96% accurate (reliable zone)
Recommended Production Thresholds
if confidence >= 0.65:
# Auto-classify β reliable (87-96% accurate)
elif 0.60 <= confidence < 0.65:
# Auto-classify with logging
else: # confidence < 0.60
# Route to human review queue
How to Use
Direct Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("sourabh5500/hate-speech-muril")
model = AutoModelForSequenceClassification.from_pretrained(
"sourabh5500/hate-speech-muril"
)
model.eval()
def predict(text: str) -> dict:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
label_id = probs.argmax().item()
return {
"label": model.config.id2label[label_id],
"confidence": probs[label_id].item(),
"scores": {
"NOT": probs[0].item(),
"HOF": probs[1].item()
}
}
# Test
print(predict("yaar tu bahut bura insaan hai"))
print(predict("aaj ka din bahut acha tha"))
Using the API
curl -X POST "https://your-api-url/predict" \
-H "Content-Type: application/json" \
-d '{"text": "yaar tu bahut bura insaan hai"}'
Bias & Fairness Considerations
- Language bias: Trained primarily on Hinglish/English. Performance degrades on pure Hindi (Devanagari script) or other Indian languages.
- Topic bias: Dataset sourced from Twitter/social media β may not generalize to other platforms (WhatsApp, YouTube comments).
- Annotator bias: Human annotations inherently reflect annotator demographics and perspectives. Hate speech labeling is subjective.
- Recency bias: Training data has a cutoff date. New slang and emerging hate speech patterns won't be captured.
- Class bias: Despite near-balanced dataset (1.16x), HOF recall (0.667) is lower than NOT recall (0.802).
Intended Use
β Appropriate uses:
- Content moderation assistance (human-in-the-loop)
- Research on Hinglish hate speech
- First-pass filtering with human review for borderline cases
- Academic study of multilingual toxicity detection
β Inappropriate uses:
- Fully automated moderation without human oversight
- Legal or punitive decisions based solely on model output
- Languages other than Hinglish/English
- Real-time high-stakes moderation without threshold tuning
Project Structure
SafeShield AI/
βββ src/
β βββ data/
β β βββ preprocessor.py
β β βββ loader.py
β βββ models/
β βββ baseline.py
βββ notebooks/
β βββ 01_eda.ipynb
β βββ 02_preprocessing.ipynb
β βββ 03_baseline.ipynb
β βββ 04_muril_finetune.ipynb
β βββ 05_muril_improved.ipynb
β βββ 06_error_analysis.ipynb
βββ api/
βββ app/
βββ results/
βββ MODEL_CARD.md
Citation
@misc{saxena2025safeshield,
author = {Sourabh Saxena},
title = {SafeShield AI: Multilingual Hate Speech Detection for Hinglish Text},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/sourabh5500/hate-speech-muril}
}
Author
Sourabh Saxena B.Tech CSE (AI/ML) β IMS Engineering College GitHub: sourabh-550 HuggingFace: sourabh5500
This model is intended as a research tool. Always use human oversight for content moderation decisions. The author is not responsible for misuse of this model.
- Downloads last month
- 48
Space using sourabh5500/hate-speech-muril 1
Evaluation results
- Macro F1self-reported0.735
- Accuracyself-reported0.739