codemix_hate / README.md

Update README.md

b07d73f verified 9 months ago

3.84 kB

	---
	library_name: transformers
	language:
	- en
	- hi
	---

	# Hindi-English Code-mixed Hate Detection

	<!-- Provide a quick summary of what the model is/does. -->
	<p align="left">
	<img src="hate_logo.png" alt="Project Logo" width="300"/>
	</p>


	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model is for Hindi-English code-mixed hate detection.

	- Developed by: Debajyoti Mazumder, Aakash Kumar
	- Model type: Text Classification
	- Language(s) : Hindi-English code-mixed
	- Parent Model: See the [BERT multilingual base model (cased)](https://huggingface.co/google-bert/bert-base-multilingual-cased) for more information about the model.
	- Paper: [https://dl.acm.org/doi/full/10.1145/3726866](https://dl.acm.org/doi/full/10.1145/3726866)

	## How to Get Started with the Model

	Details of usage


	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	tokenizer = AutoTokenizer.from_pretrained("debajyotimaz/codemix_hate")
	model = AutoModelForSequenceClassification.from_pretrained("debajyotimaz/codemix_hate")
	inputs = tokenizer("Mai tumse hate karta hun", return_tensors="pt")
	prediction= model(input_ids=inputs['input_ids'],attention_mask=inputs['attention_mask'])
	print(prediction.logits)
	```

	## Citation

	```bibtex
	@article{10.1145/3726866,
	author = {Mazumder, Debajyoti and Kumar, Aakash and Patro, Jasabanta},
	title = {Improving Code-Mixed Hate Detection by Native Sample Mixing: A Case Study for Hindi-English Code-Mixed Scenario},
	year = {2025},
	issue_date = {May 2025},
	publisher = {Association for Computing Machinery},
	address = {New York, NY, USA},
	volume = {24},
	number = {5},
	issn = {2375-4699},
	url = {https://doi.org/10.1145/3726866},
	doi = {10.1145/3726866},
	abstract = {Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see much less work on code-mixed hate as large-scale annotated hate corpora are unavailable for the study. To overcome this bottleneck, we propose using native language hate samples (native language samples/ native samples hereafter). We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This article attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) the visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn’t help much to detect code-mixed hate. We have released the data and code repository to reproduce the reported results.1},
	journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
	month = apr,
	articleno = {47},
	numpages = {21},
	keywords = {Code-mixed hate detection, cross-lingual learning, native sample mixing}
	}
	```