# Sentiment Analysis Model

This model is designed for sentiment analysis of English text. It predicts the sentiment of a given text as one of three classes: `positive`, `neutral`, or `negative`. The model was trained on a combination of datasets from Kaggle and Sentiment140.

## Model Description

The model card describes two approaches:
1. **Baseline Model**: A classical machine learning pipeline using TF-IDF vectorization and Logistic Regression.
2. **CNN Model**: A lightweight Convolutional Neural Network (CNN) implemented in Keras.

The best-performing model (based on validation macro-F1 score) is selected for inference.

### Baseline Model
- **Vectorizer**: TF-IDF (word + character n-grams)
- **Classifier**: Logistic Regression
- **Features**: 200,000 max features, n-gram range (1, 2)

### CNN Model
- **Tokenizer**: Keras Tokenizer
- **Architecture**: Embedding layer -> 1D Convolution -> Global Max Pooling -> Dense layers

## Training Data

The model was trained on a combination of datasets:
- **Kaggle Train**: 27,477 samples
- **Sentiment140 Train**: 300,000 balanced samples
- **Sentiment140 Manual Test**: 516 samples

The datasets were cleaned and unified into a common schema with `text` and `sentiment` columns.

## Evaluation

The model was evaluated on a stratified validation split (15% of the training data). The best model was selected based on the macro-F1 score.