Sentiment Analysis Model

This model is designed for sentiment analysis of English text. It predicts the sentiment of a given text as one of three classes: positive, neutral, or negative. The model was trained on a combination of datasets from Kaggle and Sentiment140.

Model Description

The model card describes two approaches:

Baseline Model: A classical machine learning pipeline using TF-IDF vectorization and Logistic Regression.
CNN Model: A lightweight Convolutional Neural Network (CNN) implemented in Keras.

The best-performing model (based on validation macro-F1 score) is selected for inference.

Baseline Model

Vectorizer: TF-IDF (word + character n-grams)
Classifier: Logistic Regression
Features: 200,000 max features, n-gram range (1, 2)

CNN Model

Tokenizer: Keras Tokenizer
Architecture: Embedding layer -> 1D Convolution -> Global Max Pooling -> Dense layers

Training Data

The model was trained on a combination of datasets:

Kaggle Train: 27,477 samples
Sentiment140 Train: 300,000 balanced samples
Sentiment140 Manual Test: 516 samples

The datasets were cleaned and unified into a common schema with text and sentiment columns.

Evaluation

The model was evaluated on a stratified validation split (15% of the training data). The best model was selected based on the macro-F1 score.