Update README.md

bd3c2b0 verified 4 months ago

3.89 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: xlm-roberta-base
	tags:
	- generated_from_trainer
	model-index:
	- name: bengali-code-mix-sentiment
	results: []
	datasets:
	- Swarnadeep-28/bn_code_mix_sentiment_dataset
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	---
	# Bengali-English Code-Mixed Sentiment Model

	## Model Summary
	This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) for sentiment analysis on Bengali–English code-mixed text (social media posts, comments, and tweets).

	- Task: Text Classification (Sentiment Analysis)
	- Languages: Bengali (Romanized) + English
	- Classes: `0`, `1`, `2`, `3`
	- Fine-tuning method: Full fine-tuning
	- Dataset: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)

	This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.

	---

	## How to Use

	### Inference Example
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "Swarnadeep-28/bengali-code-mix-sentiment"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	text = "Aaj match ta khub bhalo chilo! Loved it."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

	with torch.no_grad():
	logits = model(**inputs).logits
	pred = torch.argmax(logits, dim=-1).item()
	labels = ["0", "1", "2", "3"]
	print("Predicted label:", labels[pred])
	```

	---

	## Training Details

	- Base model: `xlm-roberta-base`
	- Method: Full fine-tuning (all parameters updated)
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Epochs: 3
	- Batch Size: 16 (train), 32 (eval)
	- Hardware: Trained on a single GPU (Colab T4 / equivalent)

	---

	## Evaluation

	### Classification Report
	\| Label \| Precision \| Recall \| F1-Score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| 0 \| 0.80 \| 0.73 \| 0.77 \| 528 \|
	\| 1 \| 0.73 \| 0.73 \| 0.73 \| 617 \|
	\| 2 \| 0.69 \| 0.76 \| 0.72 \| 675 \|
	\| 3 \| 0.67 \| 0.57 \| 0.62 \| 182 \|

	### Overall Metrics
	- Accuracy: 0.73
	- Macro Avg: Precision = 0.72, Recall = 0.70, F1 = 0.71
	- Weighted Avg: Precision = 0.73, Recall = 0.73, F1 = 0.73
	- Total Samples: 2002

	---

	## Applications
	- Sentiment classification of Bengali-English social media text
	- Research in code-mixed NLP for Indic languages
	- Benchmark for parameter-efficient fine-tuning (compare with LoRA model)

	---

	## Limitations
	- Heavily Romanized or slang-heavy Bengali may reduce accuracy
	- Trained primarily on short-form text (tweets, comments, reviews)
	- Not designed for abusive/toxic content moderation or safety-critical use cases

	---

	## Ethical Considerations
	- Data reflects natural biases from social media sources
	- Misclassifications may occur in sarcasm or offensive text
	- Should not be the sole basis for critical decision-making

	---

	## Citation
	If you use this model, please cite:

	```bibtex
	@model{das2025_bn_code_mix_sentiment,
	author = {Swarnadeep Das},
	title = {Bengali-English Code-Mixed Sentiment Model},
	year = {2025},
	url = {https://huggingface.co/Swarnadeep-28/bengali-code-mix-sentiment}
	}
	```

	---

	## Acknowledgements
	- Dataset: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/Swarnadeep-28/bn_code_mix_sentiment_dataset)
	- Base model: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
	- Frameworks: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets)