| # Sentiment Analysis Model | |
| This model is designed for sentiment analysis of English text. It predicts the sentiment of a given text as one of three classes: `positive`, `neutral`, or `negative`. The model was trained on a combination of datasets from Kaggle and Sentiment140. | |
| ## Model Description | |
| The model card describes two approaches: | |
| 1. **Baseline Model**: A classical machine learning pipeline using TF-IDF vectorization and Logistic Regression. | |
| 2. **CNN Model**: A lightweight Convolutional Neural Network (CNN) implemented in Keras. | |
| The best-performing model (based on validation macro-F1 score) is selected for inference. | |
| ### Baseline Model | |
| - **Vectorizer**: TF-IDF (word + character n-grams) | |
| - **Classifier**: Logistic Regression | |
| - **Features**: 200,000 max features, n-gram range (1, 2) | |
| ### CNN Model | |
| - **Tokenizer**: Keras Tokenizer | |
| - **Architecture**: Embedding layer -> 1D Convolution -> Global Max Pooling -> Dense layers | |
| ## Training Data | |
| The model was trained on a combination of datasets: | |
| - **Kaggle Train**: 27,477 samples | |
| - **Sentiment140 Train**: 300,000 balanced samples | |
| - **Sentiment140 Manual Test**: 516 samples | |
| The datasets were cleaned and unified into a common schema with `text` and `sentiment` columns. | |
| ## Evaluation | |
| The model was evaluated on a stratified validation split (15% of the training data). The best model was selected based on the macro-F1 score. |