codealchemist01 commited on
Commit
62f41ce
·
verified ·
1 Parent(s): 49e2de9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +175 -0
README.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ tags:
4
+ - sentiment-analysis
5
+ - turkish
6
+ - bert
7
+ - text-classification
8
+ - fine-tuned
9
+ license: apache-2.0
10
+ base_model: codealchemist01/turkish-sentiment-analysis
11
+ datasets:
12
+ - winvoker/turkish-sentiment-analysis-dataset
13
+ - WhiteAngelss/Turkce-Duygu-Analizi-Dataset
14
+ - maydogan/Turkish_SentimentAnalysis_TRSAv1
15
+ - turkish-nlp-suite/MusteriYorumlari
16
+ - W4nkel/turkish-sentiment-dataset
17
+ metrics:
18
+ - accuracy
19
+ - f1
20
+ - precision
21
+ - recall
22
+ ---
23
+
24
+ # Turkish Sentiment Analysis Model (Fine-tuned)
25
+
26
+ A fine-tuned version of the [codealchemist01/turkish-sentiment-analysis](https://huggingface.co/codealchemist01/turkish-sentiment-analysis) model, improved with additional balanced training data to enhance neutral and negative class performance.
27
+
28
+ ## Model Details
29
+
30
+ - **Base Model:** [codealchemist01/turkish-sentiment-analysis](https://huggingface.co/codealchemist01/turkish-sentiment-analysis)
31
+ - **Task:** Text Classification (Sentiment Analysis)
32
+ - **Language:** Turkish
33
+ - **Labels:** positive, negative, neutral
34
+ - **Fine-tuning Type:** Continued fine-tuning on balanced dataset
35
+
36
+ ## Training Data
37
+
38
+ This model was fine-tuned on a balanced combination of the original dataset and additional Turkish sentiment datasets:
39
+
40
+ ### Original Dataset (from base model):
41
+ - `winvoker/turkish-sentiment-analysis-dataset` (440,641 samples)
42
+ - `WhiteAngelss/Turkce-Duygu-Analizi-Dataset` (440,641 samples)
43
+
44
+ ### Additional Datasets for Fine-tuning:
45
+ - `maydogan/Turkish_SentimentAnalysis_TRSAv1` (150,000 samples)
46
+ - `turkish-nlp-suite/MusteriYorumlari` (73,920 samples)
47
+ - `W4nkel/turkish-sentiment-dataset` (4,800 samples)
48
+ - `mustfkeskin/turkish-movie-sentiment-analysis-dataset` (Kaggle, 83,227 samples)
49
+
50
+ ### Final Balanced Dataset:
51
+ - **Total:** 556,888 samples
52
+ - **Positive:** 237,966 (42.7%)
53
+ - **Neutral:** 209,668 (37.6%)
54
+ - **Negative:** 109,254 (19.6%)
55
+
56
+ **Split Distribution:**
57
+ - **Training:** 445,510 samples
58
+ - **Validation:** 55,689 samples
59
+ - **Test:** 55,689 samples
60
+
61
+ ## Training
62
+
63
+ ### Fine-tuning Parameters:
64
+ - **Base Model:** codealchemist01/turkish-sentiment-analysis
65
+ - **Epochs:** 2
66
+ - **Learning Rate:** 1e-5 (lower than initial training for fine-tuning)
67
+ - **Batch Size:** 12 (per device)
68
+ - **Gradient Accumulation:** 2 (effective batch size: 24)
69
+ - **Max Length:** 128 tokens
70
+ - **Optimizer:** AdamW
71
+ - **Mixed Precision (FP16):** Enabled
72
+
73
+ ## Performance
74
+
75
+ ### Test Set Results (55,689 samples):
76
+
77
+ **Overall Metrics:**
78
+ - **Accuracy:** 91.96%
79
+ - **Weighted F1:** 91.93%
80
+ - **Weighted Precision:** 91.93%
81
+ - **Weighted Recall:** 91.96%
82
+
83
+ ### Per-Class Performance:
84
+
85
+ | Class | Precision | Recall | F1-Score | Support |
86
+ |----------|-----------|--------|----------|---------|
87
+ | Negative | 90.65% | 86.79% | 88.68% | 10,926 |
88
+ | Neutral | 90.91% | 90.24% | 90.57% | 20,967 |
89
+ | Positive | 93.41% | 95.84% | 94.61% | 23,796 |
90
+
91
+ ## Improvements Over Base Model
92
+
93
+ ### Key Improvements:
94
+ 1. **Neutral Class Performance:**
95
+ - Better recognition of neutral expressions
96
+ - Improved handling of ambiguous texts
97
+ - Neutral F1-score: **90.57%** (improved from base model's test performance)
98
+
99
+ 2. **Better Class Balance:**
100
+ - More balanced dataset (reduced class imbalance)
101
+ - Negative class improved with more training examples
102
+ - Neutral class significantly enhanced
103
+
104
+ 3. **General Performance:**
105
+ - Maintained high accuracy (91.96%)
106
+ - Improved F1-scores across all classes
107
+ - Better generalization on diverse Turkish texts
108
+
109
+ ### Test Results Comparison (15 sample test):
110
+ - **Base Model Accuracy:** 66.7% (10/15)
111
+ - **Fine-tuned Model Accuracy:** 86.7% (13/15)
112
+ - **Improvement:** +20.0%
113
+
114
+ ### Per-Class Test Results:
115
+ - **Neutral:** 0% → 80% (+80.0% improvement)
116
+ - **Negative:** 100% → 80% (slight decrease, but more balanced)
117
+ - **Positive:** 100% → 100% (maintained)
118
+
119
+ ## Usage
120
+
121
+ ```python
122
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
123
+ import torch
124
+
125
+ # Load model and tokenizer
126
+ model_name = "codealchemist01/turkish-sentiment-analysis-finetuned"
127
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
128
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
129
+
130
+ # Example text
131
+ text = "Bu ürün normal, beklediğim gibi. Özel bir şey yok."
132
+
133
+ # Tokenize
134
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
135
+
136
+ # Predict
137
+ with torch.no_grad():
138
+ outputs = model(**inputs)
139
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
140
+ predicted_label_id = predictions.argmax().item()
141
+
142
+ # Map to label
143
+ id2label = {0: "negative", 1: "neutral", 2: "positive"}
144
+ predicted_label = id2label[predicted_label_id]
145
+ confidence = predictions[0][predicted_label_id].item()
146
+
147
+ print(f"Label: {predicted_label}")
148
+ print(f"Confidence: {confidence:.4f}")
149
+ ```
150
+
151
+ ## Limitations
152
+
153
+ - The model may not perform well on very short texts (< 3 words)
154
+ - Performance may vary across different domains (social media, news, reviews)
155
+ - Some ambiguous neutral expressions may still be misclassified
156
+ - Negative class performance may vary on different text types
157
+
158
+ ## Citation
159
+
160
+ If you use this model, please cite:
161
+
162
+ ```bibtex
163
+ @misc{turkish-sentiment-analysis-finetuned,
164
+ title={Turkish Sentiment Analysis Model (Fine-tuned)},
165
+ author={codealchemist01},
166
+ year={2024},
167
+ base_model={codealchemist01/turkish-sentiment-analysis},
168
+ howpublished={\url{https://huggingface.co/codealchemist01/turkish-sentiment-analysis-finetuned}}
169
+ }
170
+ ```
171
+
172
+ ## License
173
+
174
+ Apache 2.0
175
+