Indian Language Profanity Detector

A fine-tuned transformer model for detecting profanity and abusive content in Indian languages and code-mixed text such as Hinglish.

This model is intended for safety guardrails, moderation pipelines, and NLU microservices.

Model Details

Developed by: tester3792005
Task: text-classification
Model type: Text Classification
Architecture: BERT / MuRIL-based, depending on the configuration in config.json
Languages: Hindi, English, Hinglish, and other Indic languages
Labels: CLEAN / TOXIC, or your custom label set

Overview

This repository contains a Hugging Face-compatible text classification model for profanity and abuse detection in Indian languages.

It is designed to work in offline environments as long as all model files and metadata are present locally. Adding this README with Hugging Face model card metadata helps pipeline auto-detection without requiring internet access.

Features

Supports Hindi, English, Hinglish, and other Indic languages
Handles code-mixed text
Suitable for content moderation and safety filtering
Works with Hugging Face pipeline
Can be used offline if model files are stored locally

Installation

pip install transformers torch

Usage

Using Hugging Face pipeline

from transformers import pipeline

classifier = pipeline( "text-classification", model="path/to/your/model", tokenizer="path/to/your/model", device=-1 )

text = "tu bahut bekar insaan hai" result = classifier(text)

print(result)

Using AutoTokenizer and AutoModelForSequenceClassification

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

model_path = "path/to/your/model"

tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path)

text = "yeh bahut kharab hai" inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad(): outputs = model(**inputs)

logits = outputs.logits prediction = torch.argmax(logits, dim=1)

print(prediction.item())

Offline Compatibility

To ensure the model loads properly in restricted or offline networks, make sure the repository contains:

config.json
pytorch_model.bin or model.safetensors
tokenizer files such as tokenizer.json, vocab.txt, merges.txt, or similar
this README.md file with the metadata block above

The pipeline_tag: text-classification entry helps Hugging Face identify the correct task without needing to guess from the internet.

Example Inputs

Input Text	Expected Output
tu pagal hai	TOXIC
aap bahut achhe ho	CLEAN
yeh bakwaas hai bro	TOXIC

Use Cases

Content moderation
Chat safety filtering
Social media text screening
Chatbot guardrails
NLU safety microservices

Limitations

Performance may vary across dialects and slang
Sarcasm and context-dependent abuse may not always be detected correctly
Accuracy depends on the quality and diversity of the training data

Future Improvements

Add more Indic languages
Improve context-aware toxicity detection
Support multi-label classification
Add quantized inference support for edge devices

License

Apache-2.0

Acknowledgements

Hugging Face Transformers
MuRIL and other Indic language transformer models

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32