Sentiment Analysis Model
This model is designed for sentiment analysis of English text. It predicts the sentiment of a given text as one of three classes: positive, neutral, or negative. The model was trained on a combination of datasets from Kaggle and Sentiment140.
Model Description
The model card describes two approaches:
- Baseline Model: A classical machine learning pipeline using TF-IDF vectorization and Logistic Regression.
- CNN Model: A lightweight Convolutional Neural Network (CNN) implemented in Keras.
The best-performing model (based on validation macro-F1 score) is selected for inference.
Baseline Model
- Vectorizer: TF-IDF (word + character n-grams)
- Classifier: Logistic Regression
- Features: 200,000 max features, n-gram range (1, 2)
CNN Model
- Tokenizer: Keras Tokenizer
- Architecture: Embedding layer -> 1D Convolution -> Global Max Pooling -> Dense layers
Training Data
The model was trained on a combination of datasets:
- Kaggle Train: 27,477 samples
- Sentiment140 Train: 300,000 balanced samples
- Sentiment140 Manual Test: 516 samples
The datasets were cleaned and unified into a common schema with text and sentiment columns.
Evaluation
The model was evaluated on a stratified validation split (15% of the training data). The best model was selected based on the macro-F1 score.