--- library_name: transformers language: - en - hi --- # Hindi-English Code-mixed Hate Detection

Project Logo

## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model is for Hindi-English code-mixed hate detection. - **Developed by:** Debajyoti Mazumder, Aakash Kumar - **Model type:** Text Classification - **Language(s) :** Hindi-English code-mixed - **Parent Model:** See the [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased) for more information about the model. - **Paper:** [https://dl.acm.org/doi/full/10.1145/3726866](https://dl.acm.org/doi/full/10.1145/3726866) ## How to Get Started with the Model **Details of usage** ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("debajyotimaz/codemix_hate") model = AutoModelForSequenceClassification.from_pretrained("debajyotimaz/codemix_hate") inputs = tokenizer("Mai tumse hate karta hun", return_tensors="pt") prediction= model(input_ids=inputs['input_ids'],attention_mask=inputs['attention_mask']) print(prediction.logits) ``` ## Citation ```bibtex @article{10.1145/3726866, author = {Mazumder, Debajyoti and Kumar, Aakash and Patro, Jasabanta}, title = {Improving Code-Mixed Hate Detection by Native Sample Mixing: A Case Study for Hindi-English Code-Mixed Scenario}, year = {2025}, issue_date = {May 2025}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {24}, number = {5}, issn = {2375-4699}, url = {https://doi.org/10.1145/3726866}, doi = {10.1145/3726866}, abstract = {Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see much less work on code-mixed hate as large-scale annotated hate corpora are unavailable for the study. To overcome this bottleneck, we propose using native language hate samples (native language samples/ native samples hereafter). We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This article attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) the visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn’t help much to detect code-mixed hate. We have released the data and code repository to reproduce the reported results.1}, journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, month = apr, articleno = {47}, numpages = {21}, keywords = {Code-mixed hate detection, cross-lingual learning, native sample mixing} } ```