|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
--- |
|
|
|
|
|
# Hindi-English Code-mixed Hate Detection |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
<p align="left"> |
|
|
<img src="hate_logo.png" alt="Project Logo" width="300"/> |
|
|
</p> |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model is for Hindi-English code-mixed hate detection. |
|
|
|
|
|
- **Developed by:** Debajyoti Mazumder, Aakash Kumar |
|
|
- **Model type:** Text Classification |
|
|
- **Language(s) :** Hindi-English code-mixed |
|
|
- **Parent Model:** See the [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased) for more information about the model. |
|
|
- **Paper:** [https://dl.acm.org/doi/full/10.1145/3726866](https://dl.acm.org/doi/full/10.1145/3726866) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
**Details of usage** |
|
|
|
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
tokenizer = AutoTokenizer.from_pretrained("debajyotimaz/codemix_hate") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("debajyotimaz/codemix_hate") |
|
|
inputs = tokenizer("Mai tumse hate karta hun", return_tensors="pt") |
|
|
prediction= model(input_ids=inputs['input_ids'],attention_mask=inputs['attention_mask']) |
|
|
print(prediction.logits) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{10.1145/3726866, |
|
|
author = {Mazumder, Debajyoti and Kumar, Aakash and Patro, Jasabanta}, |
|
|
title = {Improving Code-Mixed Hate Detection by Native Sample Mixing: A Case Study for Hindi-English Code-Mixed Scenario}, |
|
|
year = {2025}, |
|
|
issue_date = {May 2025}, |
|
|
publisher = {Association for Computing Machinery}, |
|
|
address = {New York, NY, USA}, |
|
|
volume = {24}, |
|
|
number = {5}, |
|
|
issn = {2375-4699}, |
|
|
url = {https://doi.org/10.1145/3726866}, |
|
|
doi = {10.1145/3726866}, |
|
|
abstract = {Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see much less work on code-mixed hate as large-scale annotated hate corpora are unavailable for the study. To overcome this bottleneck, we propose using native language hate samples (native language samples/ native samples hereafter). We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This article attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) the visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn’t help much to detect code-mixed hate. We have released the data and code repository to reproduce the reported results.1}, |
|
|
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, |
|
|
month = apr, |
|
|
articleno = {47}, |
|
|
numpages = {21}, |
|
|
keywords = {Code-mixed hate detection, cross-lingual learning, native sample mixing} |
|
|
} |
|
|
``` |