Swarnadeep-28 commited on
Commit
0c71423
·
verified ·
1 Parent(s): a5d6240

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -26
README.md CHANGED
@@ -7,47 +7,118 @@ tags:
7
  model-index:
8
  - name: bengali-code-mix-sentiment
9
  results: []
 
 
 
 
 
 
 
10
  ---
 
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # bengali-code-mix-sentiment
 
 
 
 
16
 
17
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
18
 
19
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- More information needed
 
 
 
 
 
 
22
 
23
- ## Intended uses & limitations
24
 
25
- More information needed
26
 
27
- ## Training and evaluation data
 
 
 
 
 
 
28
 
29
- More information needed
 
 
 
 
30
 
31
- ## Training procedure
32
 
33
- ### Training hyperparameters
 
 
 
34
 
35
- The following hyperparameters were used during training:
36
- - learning_rate: 2e-05
37
- - train_batch_size: 16
38
- - eval_batch_size: 32
39
- - seed: 42
40
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
41
- - lr_scheduler_type: linear
42
- - num_epochs: 3
43
 
44
- ### Training results
 
 
 
45
 
 
46
 
 
 
 
 
47
 
48
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- - Transformers 4.56.1
51
- - Pytorch 2.8.0+cu126
52
- - Datasets 4.0.0
53
- - Tokenizers 0.22.0
 
7
  model-index:
8
  - name: bengali-code-mix-sentiment
9
  results: []
10
+ datasets:
11
+ - Swarnadeep-28/bn_code_mix_sentiment_dataset
12
+ metrics:
13
+ - accuracy
14
+ - f1
15
+ - precision
16
+ - recall
17
  ---
18
+ # Bengali-English Code-Mixed Sentiment Model
19
 
20
+ ## Model Summary
21
+ This model is a **fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)** for **sentiment analysis** on **Bengali–English code-mixed text** (social media posts, comments, and tweets).
22
 
23
+ - **Task**: Text Classification (Sentiment Analysis)
24
+ - **Languages**: Bengali (Romanized) + English
25
+ - **Classes**: `0`, `1`, `2`, `3`
26
+ - **Fine-tuning method**: Full fine-tuning
27
+ - **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/jojocoder28/bn_code_mix_sentiment_dataset)
28
 
29
+ This model provides strong baseline performance for code-mixed sentiment classification and can be directly applied to social media analysis and low-resource NLP research.
30
 
31
+ ---
32
+
33
+ ## How to Use
34
+
35
+ ### Inference Example
36
+ ```python
37
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
38
+ import torch
39
+
40
+ model_id = "jojocoder28/bengali-code-mix-sentiment"
41
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
42
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
43
+
44
+ text = "Aaj match ta khub bhalo chilo! Loved it."
45
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
46
+
47
+ with torch.no_grad():
48
+ logits = model(**inputs).logits
49
+ pred = torch.argmax(logits, dim=-1).item()
50
+ labels = ["0", "1", "2", "3"]
51
+ print("Predicted label:", labels[pred])
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Training Details
57
 
58
+ - **Base model**: `xlm-roberta-base`
59
+ - **Method**: Full fine-tuning (all parameters updated)
60
+ - **Optimizer**: AdamW
61
+ - **Learning Rate**: 2e-5
62
+ - **Epochs**: 3
63
+ - **Batch Size**: 16 (train), 32 (eval)
64
+ - **Hardware**: Trained on a single GPU (Colab T4 / equivalent)
65
 
66
+ ---
67
 
68
+ ## Evaluation
69
 
70
+ ### Classification Report
71
+ | Label | Precision | Recall | F1-Score | Support |
72
+ |-------|-----------|--------|----------|---------|
73
+ | 0 | 0.80 | 0.73 | 0.77 | 528 |
74
+ | 1 | 0.73 | 0.73 | 0.73 | 617 |
75
+ | 2 | 0.69 | 0.76 | 0.72 | 675 |
76
+ | 3 | 0.67 | 0.57 | 0.62 | 182 |
77
 
78
+ ### Overall Metrics
79
+ - **Accuracy**: 0.73
80
+ - **Macro Avg**: Precision = 0.72, Recall = 0.70, F1 = 0.71
81
+ - **Weighted Avg**: Precision = 0.73, Recall = 0.73, F1 = 0.73
82
+ - **Total Samples**: 2002
83
 
84
+ ---
85
 
86
+ ## Applications
87
+ - Sentiment classification of Bengali-English social media text
88
+ - Research in **code-mixed NLP for Indic languages**
89
+ - Benchmark for parameter-efficient fine-tuning (compare with LoRA model)
90
 
91
+ ---
 
 
 
 
 
 
 
92
 
93
+ ## Limitations
94
+ - Heavily Romanized or slang-heavy Bengali may reduce accuracy
95
+ - Trained primarily on short-form text (tweets, comments, reviews)
96
+ - Not designed for abusive/toxic content moderation or safety-critical use cases
97
 
98
+ ---
99
 
100
+ ## Ethical Considerations
101
+ - Data reflects natural biases from social media sources
102
+ - Misclassifications may occur in sarcasm or offensive text
103
+ - Should not be the sole basis for critical decision-making
104
 
105
+ ---
106
+
107
+ ## Citation
108
+ If you use this model, please cite:
109
+
110
+ ```bibtex
111
+ @model{das2025_bn_code_mix_sentiment,
112
+ author = {Swarnadeep Das},
113
+ title = {Bengali-English Code-Mixed Sentiment Model},
114
+ year = {2025},
115
+ url = {https://huggingface.co/jojocoder28/bengali-code-mix-sentiment}
116
+ }
117
+ ```
118
+
119
+ ---
120
 
121
+ ## Acknowledgements
122
+ - **Dataset**: [Bengali-English Code-Mixed Sentiment Dataset](https://huggingface.co/datasets/jojocoder28/bn_code_mix_sentiment_dataset)
123
+ - **Base model**: [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
124
+ - **Frameworks**: [Transformers](https://huggingface.co/docs/transformers), [Datasets](https://huggingface.co/docs/datasets)