
# **Text Classification (Traditional ML)**  
### Spam Detection using TF-IDF + Naïve Bayes

This notebook covers:
- Sentiment / Text Classification Basics  
- Train/Test Split & Evaluation Metrics  
- TF-IDF Feature Extraction  
- Naïve Bayes Model for Spam Detection  


In [1]:

# Install required libraries (if not present)
# !pip install scikit-learn pandas


In [2]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import joblib
import kagglehub

# 1. Download & Load Dataset

In [3]:
dataset_path = kagglehub.dataset_download("uciml/sms-spam-collection-dataset")

# Locate CSV file automatically
for file in os.listdir(dataset_path):
    if file.endswith(".csv"):
        data_file = os.path.join(dataset_path, file)
        break

print("Using dataset file:", data_file)

df = pd.read_csv(data_file, encoding="latin-1")[['v1','v2']]
df.columns = ['label', 'text']

Using Colab cache for faster access to the 'sms-spam-collection-dataset' dataset.
Using dataset file: /kaggle/input/sms-spam-collection-dataset/spam.csv


# 2. Prepare Data

In [4]:
X = df['text']
y = df['label']

vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

# 3. Train Model

In [5]:
model = MultinomialNB()
model.fit(X_train, y_train)

# 4. Evaluate

In [6]:
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.77      0.87       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.93      1115
weighted avg       0.97      0.97      0.97      1115

Confusion Matrix:
[[965   0]
 [ 35 115]]


# 5. Save Model & Vectorizer

In [7]:
joblib.dump(model, "spam_classifier_model.joblib")
joblib.dump(vectorizer, "tfidf_vectorizer.joblib")
print("\nSaved model and vectorizer successfully.")


Saved model and vectorizer successfully.


# 6. Test With Some Messages

In [8]:
test_texts = [
    "Congratulations! You have been selected to win a $1000 gift card!",
    "Hey, are we still on for the meeting tomorrow?",
    "Click this link to claim your exclusive reward!!!",
    "Can you send me the documents?"
]

test_vectors = vectorizer.transform(test_texts)
predictions = model.predict(test_vectors)

print("\nModel Predictions:")
for text, pred in zip(test_texts, predictions):
    print(f"{pred.upper()} --> {text}")


Model Predictions:
SPAM --> Congratulations! You have been selected to win a $1000 gift card!
HAM --> Hey, are we still on for the meeting tomorrow?
SPAM --> Click this link to claim your exclusive reward!!!
HAM --> Can you send me the documents?
