File size: 2,469 Bytes
ef993ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35c9eb8
 
 
 
 
 
 
 
 
 
 
ef993ef
 
 
 
 
 
 
 
 
 
 
 
35c9eb8
 
 
 
 
 
 
 
 
ef993ef
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: mit
language:
  - en
library_name: sklearn
tags:
  - text-classification
  - emotion-detection
  - sklearn
  - skops
datasets:
  - custom
metrics:
  - accuracy
pipeline_tag: text-classification
---

# 6 Emotions Text Classification Model

A logistic regression model for classifying text into 6 emotion categories.

## Model Description

- **Model type:** Logistic Regression with TF-IDF features
- **Language:** English
- **Task:** Multi-class text classification
- **Labels:** anger, fear, joy, love, sadness, surprise

## Training Data

This model was trained on a merged dataset from two sources:

1. **GoEmotions** (Google): A corpus of 58k Reddit comments with 27 emotion categories
   - Source: [Kaggle](https://www.kaggle.com/datasets/shivamb/go-emotions-google-emotions-dataset)
   - Paper: [arXiv:2005.00547](https://arxiv.org/abs/2005.00547)

2. **Emotion Dataset**: Text samples labeled with basic emotions
   - Source: [Kaggle](https://www.kaggle.com/datasets/parulpandey/emotion-dataset/data)
   - Paper: [EMNLP 2018](https://www.aclweb.org/anthology/D18-1404)

Labels were mapped to 6 core emotion categories for this model.

## Features

The model uses a combination of:
- **Word-level TF-IDF:** unigrams to trigrams (max 20,000 features)
- **Character-level TF-IDF:** 3-5 character n-grams (max 15,000 features)

## Training

- **Framework:** scikit-learn
- **Hyperparameter tuning:** GridSearchCV with 3-fold cross-validation
- **Class balancing:** `class_weight='balanced'`

## Performance

### Model Metrics
- **Cross-Validation Accuracy:** 0.7163
- **Test Accuracy:** 0.70
- **Training Size:** 41,974
- **Test Size:** 6,067

### Confusion Matrix
![Confusion Matrix](figures/confusionMaxtrixNormalized.png)

## Limitations
- Trained on English text; performance on other languages is not guaranteed.
- May not generalize well to formal and technical texts.
- Single-label classification (no multi-emotion detection).
- Potential biases from training data sources.

## Usage

```python
import skops.io as sio

# Load model (review untrusted types before loading)
trusted_types = [
    "sklearn.pipeline.Pipeline",
    "sklearn.linear_model._logistic.LogisticRegression",
    "sklearn.feature_extraction.text.TfidfVectorizer",
    "numpy.ndarray",
    "numpy.dtype"
]

model = sio.load("6emotions_model.skops", trusted=trusted_types)

# Predict
text = "I'm so happy today!"
prediction = model.predict([text])
print(prediction)  # ['joy']