prem79 commited on
Commit
3d78b63
·
verified ·
1 Parent(s): dfb2f37

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -122
README.md CHANGED
@@ -33,55 +33,53 @@ model-index:
33
 
34
  # sentrix_roberta_V2
35
 
36
- A fine-tuned RoBERTa model for binary sentiment classification on social media text. Trained on a balanced Twitter sentiment dataset with 88.2% accuracy on a held-out test set of 40,000 samples.
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ---
39
 
40
- ## Model Summary
41
 
42
  | Property | Value |
43
  |---|---|
44
  | Base model | `cardiffnlp/twitter-roberta-base-sentiment-latest` |
45
- | Architecture | RoBERTa-base |
46
  | Task | Binary Sentiment Classification |
47
- | Labels | `NEGATIVE` (0), `POSITIVE` (1) |
48
- | Test Accuracy | **88.21%** |
49
- | Test F1 | **88.21%** |
50
  | Training samples | ~80,000 |
51
- | Test samples | 40,000 (balanced) |
52
  | Max sequence length | 128 tokens |
 
53
  | Framework | PyTorch + HuggingFace Transformers |
54
 
55
  ---
56
 
57
- ## Intended Use
58
-
59
- This model is designed to classify the sentiment of short-form social media text — primarily tweets and product reviews — as either positive or negative.
60
-
61
- **Suitable for:**
62
- - Customer review sentiment classification
63
- - Social media monitoring
64
- - Product feedback analysis
65
- - Multilingual sentiment detection (EN, FR, ES, DE, PT)
66
 
67
- **Not suitable for:**
68
- - Long-form documents (truncated at 128 tokens)
69
- - Fine-grained emotion classification (joy, anger, fear, etc.)
70
- - Neutral/mixed sentiment detection (binary output only)
71
 
72
  ---
73
 
74
- ## Training Details
75
 
76
- ### Base Model
77
 
78
- Fine-tuned from `cardiffnlp/twitter-roberta-base-sentiment-latest`, which was itself pre-trained on 58M tweets. This domain-specific pretraining gives the model strong priors for informal language, slang, abbreviations, and emoji context.
79
 
80
- ### Dataset
81
-
82
- A balanced Twitter sentiment dataset sourced from Kaggle, split as follows:
83
-
84
- | Split | Samples | NEGATIVE | POSITIVE |
85
  |---|---|---|---|
86
  | Train | ~80,000 | 50% | 50% |
87
  | Validation | 20,000 | 50% | 50% |
@@ -89,60 +87,44 @@ A balanced Twitter sentiment dataset sourced from Kaggle, split as follows:
89
 
90
  ### Preprocessing
91
 
92
- Standard RoBERTa tweet preprocessing was applied:
93
-
94
- - URLs replaced with the token `http`
95
- - User mentions replaced with the token `@user`
96
- - Text truncated to 128 tokens maximum
97
 
98
- ### Hyperparameters
 
99
 
100
- | Parameter | Value |
101
- |---|---|
102
- | Optimizer | AdamW |
103
- | Learning rate | Default Trainer schedule |
104
- | Batch size | Default HuggingFace Trainer |
105
- | Max epochs | 10 |
106
- | Early stopping | Best checkpoint saved on validation loss |
107
- | Evaluation strategy | Per 500 steps |
108
- | Metric for best model | Accuracy + F1 |
109
- | Training platform | Kaggle (GPU) |
110
 
111
- ### Training Progress
112
 
113
- The model was evaluated every 500 steps. Training loss and validation loss both decreased consistently across the first three epochs:
114
 
115
  | Step | Train Loss | Val Loss | Accuracy | F1 |
116
  |---|---|---|---|---|
117
  | 500 | 0.8806 | 0.8685 | 85.00% | 85.00% |
118
  | 1000 | 0.8451 | 0.8348 | 86.25% | 86.25% |
 
119
  | 2000 | 0.8291 | 0.8075 | 86.84% | 86.83% |
 
120
  | 3000 | 0.7788 | 0.7987 | 87.32% | 87.31% |
 
121
  | 4000 | 0.7754 | 0.8005 | 87.53% | 87.53% |
 
122
  | 5000 | 0.7676 | 0.8098 | 87.59% | 87.58% |
 
123
  | 6000 | 0.7356 | 0.7944 | 87.72% | 87.72% |
 
124
  | 7000 | 0.7310 | 0.7979 | 87.68% | 87.68% |
 
125
  | 8000 | 0.6885 | 0.8235 | 87.74% | 87.74% |
126
  | 8500 | 0.6905 | 0.8104 | 87.72% | 87.72% |
127
 
128
- The best checkpoint was saved and used for final evaluation.
129
 
130
  ---
131
 
132
- ## Evaluation Results
133
-
134
- Evaluated on the held-out test set of 40,000 samples (20,000 per class).
135
-
136
- ### Test Set Metrics
137
-
138
- | Metric | Value |
139
- |---|---|
140
- | Accuracy | **0.8821** |
141
- | F1 (macro) | **0.8821** |
142
- | Eval loss | 0.8102 |
143
- | Samples/second | 287.63 |
144
 
145
- ### Classification Report
146
 
147
  ```
148
  precision recall f1-score support
@@ -155,13 +137,20 @@ Evaluated on the held-out test set of 40,000 samples (20,000 per class).
155
  weighted avg 0.88 0.88 0.88 40,000
156
  ```
157
 
158
- The model achieves symmetric performance across both classes, indicating no label bias from the balanced training set.
 
 
 
 
 
 
 
159
 
160
  ---
161
 
162
- ## Usage
163
 
164
- ### Direct Inference with Pipeline
165
 
166
  ```python
167
  from transformers import pipeline
@@ -171,104 +160,111 @@ classifier = pipeline(
171
  model="prem79/sentrix_roberta_V2"
172
  )
173
 
174
- result = classifier("The camera quality on this phone is absolutely stunning")
175
- print(result)
176
  # [{'label': 'POSITIVE', 'score': 0.9505}]
 
 
 
177
  ```
178
 
179
- ### Manual Inference
180
 
181
  ```python
182
  import torch
183
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
184
  import torch.nn.functional as F
 
 
185
 
186
- model_id = "prem79/sentrix_roberta_V2"
187
-
188
- tokenizer = AutoTokenizer.from_pretrained(model_id)
189
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
190
  model.eval()
191
 
192
  def predict(text):
193
- # Preprocess (standard RoBERTa tweet normalization)
194
- import re
195
  text = re.sub(r'http\S+', 'http', text)
196
  text = re.sub(r'@\w+', '@user', text)
197
 
198
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
199
  with torch.no_grad():
200
- logits = model(**inputs).logits
201
- probs = F.softmax(logits, dim=-1)[0]
202
 
203
- labels = ["NEGATIVE", "POSITIVE"]
204
- sentiment = labels[probs.argmax().item()]
205
  return {
206
- "sentiment": sentiment,
207
- "negative": round(probs[0].item() * 100, 2),
208
- "positive": round(probs[1].item() * 100, 2),
209
  }
210
 
211
- # Examples
212
- print(predict("The new phone camera is absolutely stunning at night"))
213
  # {'sentiment': 'POSITIVE', 'negative': 4.95, 'positive': 95.05}
214
 
215
- print(predict("Battery is terrible, drains in 2 hours, not worth the price"))
216
  # {'sentiment': 'NEGATIVE', 'negative': 94.72, 'positive': 5.28}
217
 
218
- print(predict("Ce produit est incroyable! Très satisfait de la qualité."))
219
  # {'sentiment': 'POSITIVE', 'negative': 7.18, 'positive': 92.82}
 
220
  ```
221
 
222
- ### Batch Inference
223
 
224
  ```python
225
  texts = [
226
- "Absolutely love this product!",
227
- "Worst experience I have ever had",
228
- "This product is okay I guess, nothing special",
229
  ]
230
 
231
- inputs = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
 
 
 
232
  with torch.no_grad():
233
- logits = model(**inputs).logits
234
- probs = F.softmax(logits, dim=-1)
235
 
236
- for text, prob in zip(texts, probs):
237
- label = "POSITIVE" if prob[1] > prob[0] else "NEGATIVE"
238
- print(f"{label} ({prob[1].item():.2%} pos) | {text}")
239
  ```
240
 
241
  ---
242
 
243
- ## Live Demo
244
 
245
- This model powers the SENTRIX sentiment analysis web application:
 
 
 
 
 
 
 
 
 
 
 
 
246
 
247
  - Frontend: https://prem-479.github.io/sentrix_ML_IA/
248
- - Source: https://github.com/prem-479/sentrix_ML_IA
249
 
250
- The application demonstrates:
251
- - Real-time sentiment classification
252
- - Aspect extraction from product reviews
253
- - Multilingual input handling (EN, FR, ES, DE, PT)
254
- - Emoji signal detection
255
- - Confidence score visualization
256
 
257
  ---
258
 
259
- ## Limitations
260
 
261
- - **Binary only** outputs NEGATIVE or POSITIVE only. Sarcasm and neutral/mixed sentiment are classified as one or the other based on dominant signal.
262
- - **Short text optimized** — trained on tweets (short text). Performance may degrade on long documents due to the 128-token truncation limit.
263
- - **Sarcasm** — the model does not detect sarcasm. "Oh great, another broken product" will likely be classified as POSITIVE.
264
- - **Multilingual** — the base model has some cross-lingual capability from Twitter pretraining, but was fine-tuned primarily on English data. Non-English accuracy is lower than English accuracy.
265
- - **Domain shift** — trained on Twitter/product review data. Performance on other domains (news, medical, legal) has not been evaluated.
 
266
 
267
  ---
268
 
269
  ## Citation
270
 
271
- If you use this model, please cite the base model:
272
 
273
  ```bibtex
274
  @inproceedings{barbieri-etal-2020-tweeteval,
@@ -282,15 +278,4 @@ If you use this model, please cite the base model:
282
 
283
  ---
284
 
285
- ## Model Files
286
-
287
- | File | Description |
288
- |---|---|
289
- | `config.json` | Model architecture and label mapping |
290
- | `model.safetensors` | Model weights (499 MB) |
291
- | `tokenizer.json` | Tokenizer vocabulary |
292
- | `tokenizer_config.json` | Tokenizer configuration |
293
-
294
- ---
295
-
296
- *Fine-tuned on Kaggle using GPU acceleration. Trained with HuggingFace Transformers and PyTorch.*
 
33
 
34
  # sentrix_roberta_V2
35
 
36
+ Fine-tuned RoBERTa for binary sentiment classification on social media text. 88.21% accuracy on a held-out test set of 40,000 balanced samples, trained on Kaggle with GPU acceleration.
37
+
38
+ Runs locally via the HuggingFace Transformers library. Downloads once on first use, cached for all subsequent runs. No cloud subscription required.
39
+
40
+ ---
41
+
42
+ ## What this model does
43
+
44
+ It reads text, returns a POSITIVE or NEGATIVE label, and provides per-class confidence scores. Straightforward by design.
45
+
46
+ Labels: `NEGATIVE` (0) and `POSITIVE` (1). Binary output only. See the Limitations section if you need neutral classification.
47
+
48
+ Test accuracy: **88.21%**. Symmetric across both classes, meaning it is not secretly biased toward one label because the training set was balanced from the start.
49
 
50
  ---
51
 
52
+ ## Model details
53
 
54
  | Property | Value |
55
  |---|---|
56
  | Base model | `cardiffnlp/twitter-roberta-base-sentiment-latest` |
57
+ | Architecture | RoBERTa-base (125M parameters) |
58
  | Task | Binary Sentiment Classification |
59
+ | Labels | NEGATIVE (0), POSITIVE (1) |
60
+ | Test Accuracy | 88.21% |
61
+ | Test F1 | 88.21% |
62
  | Training samples | ~80,000 |
63
+ | Test samples | 40,000 (perfectly balanced) |
64
  | Max sequence length | 128 tokens |
65
+ | Training platform | Kaggle (GPU) |
66
  | Framework | PyTorch + HuggingFace Transformers |
67
 
68
  ---
69
 
70
+ ## Why this base model
 
 
 
 
 
 
 
 
71
 
72
+ Cardiff NLP's `twitter-roberta-base-sentiment-latest` was pretrained on 58 million tweets before it ever saw the fine-tuning data. That means it already understands how people actually write online - abbreviations, slang, run-on sentences, missing punctuation, words that autocorrect clearly did not help with. Starting from that checkpoint instead of vanilla RoBERTa meant the model came in with real-world social media knowledge rather than learning it from scratch during fine-tuning.
 
 
 
73
 
74
  ---
75
 
76
+ ## Training
77
 
78
+ ### Data
79
 
80
+ Balanced Twitter sentiment dataset from Kaggle. Equal number of positive and negative samples so the model cannot cheat by defaulting to the majority class.
81
 
82
+ | Split | Samples | Negative | Positive |
 
 
 
 
83
  |---|---|---|---|
84
  | Train | ~80,000 | 50% | 50% |
85
  | Validation | 20,000 | 50% | 50% |
 
87
 
88
  ### Preprocessing
89
 
90
+ Two substitutions applied before tokenization, matching the convention the base model was pretrained with:
 
 
 
 
91
 
92
+ - URLs replaced with `http`
93
+ - User mentions replaced with `@user`
94
 
95
+ Skip these and you will see a small but consistent accuracy drop on anything with links or @mentions. The model expects those specific tokens.
 
 
 
 
 
 
 
 
 
96
 
97
+ ### Training run
98
 
99
+ Trained with the HuggingFace `Trainer` API, evaluated every 500 steps. Best checkpoint saved on highest validation accuracy. Training was stopped at step 8500 (epoch 3.4 of 10 max) because the validation metrics had plateaued and the best checkpoint had already been captured.
100
 
101
  | Step | Train Loss | Val Loss | Accuracy | F1 |
102
  |---|---|---|---|---|
103
  | 500 | 0.8806 | 0.8685 | 85.00% | 85.00% |
104
  | 1000 | 0.8451 | 0.8348 | 86.25% | 86.25% |
105
+ | 1500 | 0.8336 | 0.8187 | 86.48% | 86.48% |
106
  | 2000 | 0.8291 | 0.8075 | 86.84% | 86.83% |
107
+ | 2500 | 0.8155 | 0.8062 | 87.26% | 87.26% |
108
  | 3000 | 0.7788 | 0.7987 | 87.32% | 87.31% |
109
+ | 3500 | 0.7690 | 0.7931 | 87.35% | 87.34% |
110
  | 4000 | 0.7754 | 0.8005 | 87.53% | 87.53% |
111
+ | 4500 | 0.7661 | 0.7966 | 87.61% | 87.61% |
112
  | 5000 | 0.7676 | 0.8098 | 87.59% | 87.58% |
113
+ | 5500 | 0.7407 | 0.8080 | 87.56% | 87.56% |
114
  | 6000 | 0.7356 | 0.7944 | 87.72% | 87.72% |
115
+ | 6500 | 0.7205 | 0.7986 | 87.72% | 87.72% |
116
  | 7000 | 0.7310 | 0.7979 | 87.68% | 87.68% |
117
+ | 7500 | 0.7232 | 0.7959 | 87.69% | 87.68% |
118
  | 8000 | 0.6885 | 0.8235 | 87.74% | 87.74% |
119
  | 8500 | 0.6905 | 0.8104 | 87.72% | 87.72% |
120
 
121
+ Training loss went from 0.88 to 0.69. Validation loss bottomed around step 6000-6500 and started creeping back up after that - classic sign the best checkpoint was already in the bag.
122
 
123
  ---
124
 
125
+ ## Results
 
 
 
 
 
 
 
 
 
 
 
126
 
127
+ Evaluated on the held-out test set. 40,000 samples. Never seen during training or validation.
128
 
129
  ```
130
  precision recall f1-score support
 
137
  weighted avg 0.88 0.88 0.88 40,000
138
  ```
139
 
140
+ Precision and recall are identical for both classes. The model is not sacrificing recall for precision or the other way around - it is genuinely balanced. That is what a properly balanced training set gets you.
141
+
142
+ | Metric | Value |
143
+ |---|---|
144
+ | Accuracy | 0.8821 |
145
+ | F1 (macro) | 0.8821 |
146
+ | Eval loss | 0.8102 |
147
+ | Throughput | 287.6 samples/second |
148
 
149
  ---
150
 
151
+ ## How to use it
152
 
153
+ ### Quickest way - pipeline
154
 
155
  ```python
156
  from transformers import pipeline
 
160
  model="prem79/sentrix_roberta_V2"
161
  )
162
 
163
+ print(classifier("The camera quality on this phone is absolutely stunning"))
 
164
  # [{'label': 'POSITIVE', 'score': 0.9505}]
165
+
166
+ print(classifier("Battery is terrible, drains in 2 hours, not worth the price"))
167
+ # [{'label': 'NEGATIVE', 'score': 0.9472}]
168
  ```
169
 
170
+ ### Full manual inference
171
 
172
  ```python
173
  import torch
 
174
  import torch.nn.functional as F
175
+ import re
176
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
177
 
178
+ tokenizer = AutoTokenizer.from_pretrained("prem79/sentrix_roberta_V2")
179
+ model = AutoModelForSequenceClassification.from_pretrained("prem79/sentrix_roberta_V2")
 
 
180
  model.eval()
181
 
182
  def predict(text):
183
+ # preprocess - do not skip this, the model expects these tokens
 
184
  text = re.sub(r'http\S+', 'http', text)
185
  text = re.sub(r'@\w+', '@user', text)
186
 
187
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
188
  with torch.no_grad():
189
+ probs = F.softmax(model(**inputs).logits, dim=-1)[0]
 
190
 
 
 
191
  return {
192
+ "sentiment": "POSITIVE" if probs[1] > probs[0] else "NEGATIVE",
193
+ "negative": round(probs[0].item() * 100, 2),
194
+ "positive": round(probs[1].item() * 100, 2),
195
  }
196
 
197
+ predict("The new phone camera is absolutely stunning at night")
 
198
  # {'sentiment': 'POSITIVE', 'negative': 4.95, 'positive': 95.05}
199
 
200
+ predict("Battery is terrible, drains in 2 hours, not worth the price")
201
  # {'sentiment': 'NEGATIVE', 'negative': 94.72, 'positive': 5.28}
202
 
203
+ predict("Ce produit est incroyable! Tres satisfait de la qualite.")
204
  # {'sentiment': 'POSITIVE', 'negative': 7.18, 'positive': 92.82}
205
+ # cross-lingual capability from base model pretraining
206
  ```
207
 
208
+ ### Batch inference
209
 
210
  ```python
211
  texts = [
212
+ "Absolutely love this, best purchase this year",
213
+ "Returned it on day two, complete waste of money",
214
+ "It is okay I guess, nothing to write home about",
215
  ]
216
 
217
+ inputs = tokenizer(
218
+ texts, padding=True, truncation=True,
219
+ max_length=128, return_tensors="pt"
220
+ )
221
  with torch.no_grad():
222
+ probs = F.softmax(model(**inputs).logits, dim=-1)
 
223
 
224
+ for text, p in zip(texts, probs):
225
+ label = "POSITIVE" if p[1] > p[0] else "NEGATIVE"
226
+ print(f"{label} ({p[1].item():.1%} pos) | {text}")
227
  ```
228
 
229
  ---
230
 
231
+ ## What it cannot do
232
 
233
+ Known constraints and failure modes:
234
+
235
+ - **Neutral sentiment** - binary output only. Text that is neither positive nor negative gets pushed into whichever class the token distribution leans toward. If you need three-way classification, this is not your model.
236
+ - **Sarcasm** - "oh great, another product that broke on day one, absolutely love it" will likely be classified as POSITIVE. The model sees "great," "love," and decides accordingly. Sarcasm detection is a different and significantly harder problem.
237
+ - **Long documents** - hard truncation at 128 tokens. Anything longer gets cut off. The first 128 tokens determine the output. If the important negative content is at the end of a long review, the model might miss it.
238
+ - **Domain shift** - trained on tweets and product reviews. Performance on news articles, legal documents, medical text, or academic writing has not been tested and will probably be worse.
239
+ - **Non-English accuracy** - the base model has cross-lingual capability from Twitter pretraining but the fine-tuning data was primarily English. French, Spanish, German, and Portuguese work but at lower confidence than English.
240
+
241
+ ---
242
+
243
+ ## Live demo
244
+
245
+ This model powers the SENTRIX web application:
246
 
247
  - Frontend: https://prem-479.github.io/sentrix_ML_IA/
248
+ - Source code: https://github.com/prem-479/sentrix_ML_IA
249
 
250
+ The app runs the model locally on your machine. The frontend just sends text to your Flask server and displays the results. No cloud inference. No data leaving your device.
 
 
 
 
 
251
 
252
  ---
253
 
254
+ ## Files in this repository
255
 
256
+ | File | Size | What it is |
257
+ |---|---|---|
258
+ | `config.json` | 886 B | Model architecture config and label mapping |
259
+ | `model.safetensors` | 499 MB | The actual weights. This is the big one. |
260
+ | `tokenizer.json` | 3.56 MB | Tokenizer vocabulary |
261
+ | `tokenizer_config.json` | 387 B | Tokenizer settings |
262
 
263
  ---
264
 
265
  ## Citation
266
 
267
+ This model fine-tunes Cardiff NLP's RoBERTa checkpoint. If you use this in something academic:
268
 
269
  ```bibtex
270
  @inproceedings{barbieri-etal-2020-tweeteval,
 
278
 
279
  ---
280
 
281
+ *Trained on Kaggle with GPU acceleration. Fine-tuned from cardiffnlp/twitter-roberta-base-sentiment-latest.*