# Spam Classification Models ## Overview This repository contains two models designed for detecting spam in SMS messages, both trained on the `mltrev23/spam-classify` dataset. The models include: 1. **Spam Classifier**: A machine learning model trained to classify SMS messages as either spam or ham (non-spam). 2. **Count Vectorizer**: A vectorization model used to transform SMS text data into numerical feature vectors suitable for classification. ## Models ### 1. Spam Classifier - **Filename**: `spam_classifier.pkl` - **Type**: Multinomial Naive Bayes (or other type, based on your actual model) - **Input**: Numerical feature vectors (output from the Count Vectorizer) - **Output**: Binary classification (`spam` or `ham`) ### 2. Count Vectorizer - **Filename**: `count_vectorizer.pkl` - **Type**: Scikit-learn's `CountVectorizer` - **Input**: Raw SMS text data - **Output**: Sparse matrix of token counts ## Dataset Both models were trained on the `mltrev23/spam-classify` dataset, which consists of SMS messages labeled as either spam or ham. The dataset includes a diverse set of SMS messages that provide a robust training set for detecting unwanted or harmful content. ## Installation To use these models, first clone this repository and install the required Python packages: ```bash git clone https://huggingface.co/mltrev23/spam-classification cd spam-classification pip install -r requirements.txt ``` ### Requirements The models require the following Python libraries: ```bash pip install scikit-learn pip install numpy ``` ## Usage ### Loading the Models You can load the models using the `joblib` library: ```python import joblib # Load the count vectorizer vectorizer = joblib.load('count_vectorizer.pkl') # Load the spam classifier classifier = joblib.load('spam_classifier.pkl') ``` ### Predicting Spam Messages To classify new SMS messages, follow these steps: ```python # Sample SMS messages messages = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)", "I'll call you later"] # Transform the messages into feature vectors X = vectorizer.transform(messages) # Predict using the classifier predictions = classifier.predict(X) # Output the predictions for message, prediction in zip(messages, predictions): print(f"Message: {message} \nPrediction: {'Spam' if prediction == 'spam' else 'Ham'}\n") ``` ### Evaluating the Classifier You can also evaluate the performance of the classifier using a test set from the same dataset: ```python from sklearn.metrics import accuracy_score, classification_report # Assuming you have a test set of messages and labels X_test = vectorizer.transform(test_messages) y_pred = classifier.predict(X_test) print("Accuracy:", accuracy_score(test_labels, y_pred)) print(classification_report(test_labels, y_pred)) ``` ## Model Interpretation The spam classifier is a powerful tool for identifying unwanted SMS messages, but understanding why it makes certain decisions is also crucial. You can inspect the model's learned parameters, such as the most influential words for each class (spam or ham), to gain insights into how the model works. ## Contributing If you wish to contribute to this repository by improving the models or expanding the dataset, feel free to submit a pull request. Please ensure that your code is well-documented and adheres to the existing style. ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## References If you use these models in your research or project, please cite the dataset and relevant model training methods as follows: - **Dataset**: `mltrev23/spam-classify` - **Naive Bayes**: McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (pp. 41-48).