Indian Language Profanity Detector
A fine-tuned transformer model for detecting profanity and abusive content in Indian languages and code-mixed text such as Hinglish.
This model is intended for safety guardrails, moderation pipelines, and NLU microservices.
Model Details
- Developed by: tester3792005
- Task: text-classification
- Model type: Text Classification
- Architecture: BERT / MuRIL-based, depending on the configuration in config.json
- Languages: Hindi, English, Hinglish, and other Indic languages
- Labels: CLEAN / TOXIC, or your custom label set
Overview
This repository contains a Hugging Face-compatible text classification model for profanity and abuse detection in Indian languages.
It is designed to work in offline environments as long as all model files and metadata are present locally. Adding this README with Hugging Face model card metadata helps pipeline auto-detection without requiring internet access.
Features
- Supports Hindi, English, Hinglish, and other Indic languages
- Handles code-mixed text
- Suitable for content moderation and safety filtering
- Works with Hugging Face pipeline
- Can be used offline if model files are stored locally
Installation
pip install transformers torch
Usage
Using Hugging Face pipeline
from transformers import pipeline
classifier = pipeline( "text-classification", model="path/to/your/model", tokenizer="path/to/your/model", device=-1 )
text = "tu bahut bekar insaan hai" result = classifier(text)
print(result)
Using AutoTokenizer and AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch
model_path = "path/to/your/model"
tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path)
text = "yeh bahut kharab hai" inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad(): outputs = model(**inputs)
logits = outputs.logits prediction = torch.argmax(logits, dim=1)
print(prediction.item())
Offline Compatibility
To ensure the model loads properly in restricted or offline networks, make sure the repository contains:
- config.json
- pytorch_model.bin or model.safetensors
- tokenizer files such as tokenizer.json, vocab.txt, merges.txt, or similar
- this README.md file with the metadata block above
The pipeline_tag: text-classification entry helps Hugging Face identify the correct task without needing to guess from the internet.
Example Inputs
| Input Text | Expected Output |
|---|---|
| tu pagal hai | TOXIC |
| aap bahut achhe ho | CLEAN |
| yeh bakwaas hai bro | TOXIC |
Use Cases
- Content moderation
- Chat safety filtering
- Social media text screening
- Chatbot guardrails
- NLU safety microservices
Limitations
- Performance may vary across dialects and slang
- Sarcasm and context-dependent abuse may not always be detected correctly
- Accuracy depends on the quality and diversity of the training data
Future Improvements
- Add more Indic languages
- Improve context-aware toxicity detection
- Support multi-label classification
- Add quantized inference support for edge devices
License
Apache-2.0
Acknowledgements
- Hugging Face Transformers
- MuRIL and other Indic language transformer models
- Downloads last month
- 134