{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cd2a615d",
   "metadata": {},
   "source": [
    "# Data Preprocessing\n",
    "\n",
    "EDA gave us a clear picture of what we're working with. Now the job is to make the raw text usable transforming it from messy, inconsistent strings into a clean, uniform representation that a vectoriser can work with reliably.\n",
    "\n",
    "Models don't understand that \"GREAT\", \"great\", and \"great!!\" all carry the same meaning. They see three different tokens. Every inconsistency in the raw text is an opportunity for the model to learn a spurious pattern instead of genuine sentiment. Preprocessing is how we close those loopholes before they become problems.\n",
    "\n",
    "The decisions made here as what to strip, what to keep, how to combine fields directly shape the feature space. We're not just cleaning data; we're making choices about what information the model is even allowed to see."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "289e4659",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb18f311",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "back to root to call functions from helpers.py file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3eeedf88",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "e:\\AI_ML\\proj\\sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens\\notebooks\n",
      "E:\\AI_ML\\proj\\sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "print(Path.cwd())\n",
    "os.chdir(Path('..').resolve())\n",
    "from src.utils.helpers import clean_text, save\n",
    "print(Path.cwd())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9da929eb",
   "metadata": {},
   "source": [
    "## Loading the Balanced Training Data\n",
    "\n",
    "We load the balanced dataset produced at the end of EDA — already cleaned of basic procedures. The `.info()` check below is a quick sanity check that everything still alright."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b899920f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "      <th>review_content_char_count</th>\n",
       "      <th>review_content_word_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>GREAT CAMRA</td>\n",
       "      <td>I HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ...</td>\n",
       "      <td>586</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>not so great</td>\n",
       "      <td>I'm using this book in an introductory organic...</td>\n",
       "      <td>570</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Inaccurate and disappointing</td>\n",
       "      <td>I only read the first few chapters and was bom...</td>\n",
       "      <td>214</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>Equus 3340</td>\n",
       "      <td>Feels cheaply made, the battery contacts were ...</td>\n",
       "      <td>193</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>awesome sheets!</td>\n",
       "      <td>I love these sheets! They are sleek &amp; smooth w...</td>\n",
       "      <td>198</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                  review_title  \\\n",
       "0             2                   GREAT CAMRA   \n",
       "1             1                  not so great   \n",
       "2             1  Inaccurate and disappointing   \n",
       "3             1                    Equus 3340   \n",
       "4             2               awesome sheets!   \n",
       "\n",
       "                                      review_content  \\\n",
       "0  I HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ...   \n",
       "1  I'm using this book in an introductory organic...   \n",
       "2  I only read the first few chapters and was bom...   \n",
       "3  Feels cheaply made, the battery contacts were ...   \n",
       "4  I love these sheets! They are sleek & smooth w...   \n",
       "\n",
       "  review_content_char_count review_content_word_count  \n",
       "0                       586                       108  \n",
       "1                       570                        88  \n",
       "2                       214                        40  \n",
       "3                       193                        34  \n",
       "4                       198                        38  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "balanced_sample_train = pd.read_csv(r'data/processed/balanced_sample_train.csv', dtype=str, quoting=0)\n",
    "balanced_sample_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5623cf6a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.DataFrame'>\n",
      "RangeIndex: 79972 entries, 0 to 79971\n",
      "Data columns (total 5 columns):\n",
      " #   Column                     Non-Null Count  Dtype\n",
      "---  ------                     --------------  -----\n",
      " 0   review_target              79972 non-null  str  \n",
      " 1   review_title               79972 non-null  str  \n",
      " 2   review_content             79972 non-null  str  \n",
      " 3   review_content_char_count  79972 non-null  str  \n",
      " 4   review_content_word_count  79972 non-null  str  \n",
      "dtypes: str(5)\n",
      "memory usage: 3.1 MB\n"
     ]
    }
   ],
   "source": [
    "balanced_sample_train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "524acdc6",
   "metadata": {},
   "source": [
    "## What We're Looking At\n",
    "\n",
    "Three columns matter: `review_title`, `review_content`, and `review_target`. The target is straightforward 1 for negative, 2 for positive. The text columns are where the work is done.\n",
    "\n",
    "Titles tend to be short, punchy, and deliberately expressive, customers often condense their entire opinion into a few words. Content is longer and more nuanced, but also noisier. Combining them gives the model access to both the headline sentiment and the full argument behind it, which is why we concatenate them rather than choosing one."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5b9401e",
   "metadata": {},
   "source": [
    "## Cleaning the Training Text\n",
    "\n",
    "The cleaning pipeline does several things in sequence, and the order matters. We concatenate title and content *before* cleaning so the join character doesn't accidentally survive as an artefact then normalise to lowercase, and remove punctuation..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2deb74f4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "      <th>review_content_char_count</th>\n",
       "      <th>review_content_word_count</th>\n",
       "      <th>review_content_cleaned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>GREAT CAMRA</td>\n",
       "      <td>I HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ...</td>\n",
       "      <td>586</td>\n",
       "      <td>108</td>\n",
       "      <td>great camra dx6340 year love picture good 35m ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>not so great</td>\n",
       "      <td>I'm using this book in an introductory organic...</td>\n",
       "      <td>570</td>\n",
       "      <td>88</td>\n",
       "      <td>not great using book introductory organic spec...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Inaccurate and disappointing</td>\n",
       "      <td>I only read the first few chapters and was bom...</td>\n",
       "      <td>214</td>\n",
       "      <td>40</td>\n",
       "      <td>inaccurate disappointing read first chapter bo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>Equus 3340</td>\n",
       "      <td>Feels cheaply made, the battery contacts were ...</td>\n",
       "      <td>193</td>\n",
       "      <td>34</td>\n",
       "      <td>equus 3340 feel cheaply made battery contact r...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>awesome sheets!</td>\n",
       "      <td>I love these sheets! They are sleek &amp; smooth w...</td>\n",
       "      <td>198</td>\n",
       "      <td>38</td>\n",
       "      <td>awesome sheet love sheet sleek smooth really c...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                  review_title  \\\n",
       "0             2                   GREAT CAMRA   \n",
       "1             1                  not so great   \n",
       "2             1  Inaccurate and disappointing   \n",
       "3             1                    Equus 3340   \n",
       "4             2               awesome sheets!   \n",
       "\n",
       "                                      review_content  \\\n",
       "0  I HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ...   \n",
       "1  I'm using this book in an introductory organic...   \n",
       "2  I only read the first few chapters and was bom...   \n",
       "3  Feels cheaply made, the battery contacts were ...   \n",
       "4  I love these sheets! They are sleek & smooth w...   \n",
       "\n",
       "  review_content_char_count review_content_word_count  \\\n",
       "0                       586                       108   \n",
       "1                       570                        88   \n",
       "2                       214                        40   \n",
       "3                       193                        34   \n",
       "4                       198                        38   \n",
       "\n",
       "                              review_content_cleaned  \n",
       "0  great camra dx6340 year love picture good 35m ...  \n",
       "1  not great using book introductory organic spec...  \n",
       "2  inaccurate disappointing read first chapter bo...  \n",
       "3  equus 3340 feel cheaply made battery contact r...  \n",
       "4  awesome sheet love sheet sleek smooth really c...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_train = balanced_sample_train.copy()\n",
    "processed_train['review_content_cleaned'] = clean_text(processed_train['review_title'].fillna('') + ' ' + processed_train['review_content'].fillna(''))\n",
    "processed_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0be1b03e",
   "metadata": {},
   "source": [
    "## What the Cleaned Text Looks Like\n",
    "\n",
    "The `review_content_cleaned` column should now contain plain, lowercase text with no HTML, no punctuation, and no stray whitespace. Spot-checking a few rows is worth doing here, particularly any that looked unusual in the raw data (very short reviews, reviews with lots of special characters, non-English text that slipped through).\n",
    "\n",
    "What we're looking for: the cleaned text should still be readable not near empty or too stripped"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09704c70",
   "metadata": {},
   "source": [
    "## Applying the Same Pipeline to Validation Data\n",
    "\n",
    "The validation set must go through exactly the same cleaning steps as training, same function, same parameters, same concatenation logic. Any deviation creates a mismatch between the distribution the model was trained on and the distribution it's being evaluated against, which would make our validation metrics unreliable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9f585815",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>Everything you need</td>\n",
       "      <td>This is a wonderful book. It may have been mea...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Important note about carrier</td>\n",
       "      <td>The carrier is very cute, and lightweight...ho...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Not a musical instrument -cannot be played</td>\n",
       "      <td>I bought (elsewhere) one of these harps for my...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Do I Iike this monitor? Well... I have 2!</td>\n",
       "      <td>I have 2 of these babies hooked up to a dual-o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Very disappointing</td>\n",
       "      <td>This book is very poorly written and lacks of ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                                review_title  \\\n",
       "0             2                         Everything you need   \n",
       "1             1                Important note about carrier   \n",
       "2             1  Not a musical instrument -cannot be played   \n",
       "3             2   Do I Iike this monitor? Well... I have 2!   \n",
       "4             1                          Very disappointing   \n",
       "\n",
       "                                      review_content  \n",
       "0  This is a wonderful book. It may have been mea...  \n",
       "1  The carrier is very cute, and lightweight...ho...  \n",
       "2  I bought (elsewhere) one of these harps for my...  \n",
       "3  I have 2 of these babies hooked up to a dual-o...  \n",
       "4  This book is very poorly written and lacks of ...  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_valid = pd.read_csv(r'data/samples/sample_valid.csv', dtype=str, quoting=0)\n",
    "processed_valid.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "07d9f22d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "      <th>review_content_cleaned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>Everything you need</td>\n",
       "      <td>This is a wonderful book. It may have been mea...</td>\n",
       "      <td>everything need wonderful book may meant clerg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Important note about carrier</td>\n",
       "      <td>The carrier is very cute, and lightweight...ho...</td>\n",
       "      <td>important note carrier carrier very cute light...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Not a musical instrument -cannot be played</td>\n",
       "      <td>I bought (elsewhere) one of these harps for my...</td>\n",
       "      <td>not musical instrument cannot played bought el...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Do I Iike this monitor? Well... I have 2!</td>\n",
       "      <td>I have 2 of these babies hooked up to a dual-o...</td>\n",
       "      <td>iike monitor well baby hooked dual output digi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Very disappointing</td>\n",
       "      <td>This book is very poorly written and lacks of ...</td>\n",
       "      <td>very disappointing book very poorly written la...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                                review_title  \\\n",
       "0             2                         Everything you need   \n",
       "1             1                Important note about carrier   \n",
       "2             1  Not a musical instrument -cannot be played   \n",
       "3             2   Do I Iike this monitor? Well... I have 2!   \n",
       "4             1                          Very disappointing   \n",
       "\n",
       "                                      review_content  \\\n",
       "0  This is a wonderful book. It may have been mea...   \n",
       "1  The carrier is very cute, and lightweight...ho...   \n",
       "2  I bought (elsewhere) one of these harps for my...   \n",
       "3  I have 2 of these babies hooked up to a dual-o...   \n",
       "4  This book is very poorly written and lacks of ...   \n",
       "\n",
       "                              review_content_cleaned  \n",
       "0  everything need wonderful book may meant clerg...  \n",
       "1  important note carrier carrier very cute light...  \n",
       "2  not musical instrument cannot played bought el...  \n",
       "3  iike monitor well baby hooked dual output digi...  \n",
       "4  very disappointing book very poorly written la...  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_valid['review_content_cleaned'] = clean_text(processed_valid['review_title'].fillna('') + ' ' + processed_valid['review_content'].fillna(''))\n",
    "processed_valid.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc1443d5",
   "metadata": {},
   "source": [
    "## Validation Data After Cleaning\n",
    "\n",
    "The validation set is now in the same form as the training set: a single `review_content_cleaned` column containing lowercased, punctuation-free text. No information from the training set has leaked into this process.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8168fa1d",
   "metadata": {},
   "source": [
    "## Applying the Pipeline to Test Data\n",
    "\n",
    "The test set is treated with particular care. We don't examine its label distribution, don't compute statistics on it to inform any decisions, and we certainly don't adjust the cleaning pipeline based on anything we see in it. It's processed mechanically, exactly as training and validation were.\n",
    "\n",
    "Note: unlike the training set (which concatenated title + content), the test cleaning here is applied separately to content and title. This gives us flexibility to experiment with different feature combinations at the modelling stage without needing to re-run preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5eb6491f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>This is a great book</td>\n",
       "      <td>I must preface this by saying that I am not re...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Huge Disappointment.</td>\n",
       "      <td>As a big time, long term Trevanian fan, I was ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Wayne is tight but cant hang with Turk.</td>\n",
       "      <td>This album is hot as it wants to be. However C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>I read this book when I was in elementary scho...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Not about Anusara</td>\n",
       "      <td>Although this book is touted on several Anusar...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                             review_title  \\\n",
       "0             2                     This is a great book   \n",
       "1             1                     Huge Disappointment.   \n",
       "2             2  Wayne is tight but cant hang with Turk.   \n",
       "3             2                                Excellent   \n",
       "4             1                        Not about Anusara   \n",
       "\n",
       "                                      review_content  \n",
       "0  I must preface this by saying that I am not re...  \n",
       "1  As a big time, long term Trevanian fan, I was ...  \n",
       "2  This album is hot as it wants to be. However C...  \n",
       "3  I read this book when I was in elementary scho...  \n",
       "4  Although this book is touted on several Anusar...  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_test = pd.read_csv(r'data/samples/sample_test.csv', dtype=str, quoting=0)\n",
    "processed_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "      <th>review_content_cleaned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>This is a great book</td>\n",
       "      <td>I must preface this by saying that I am not re...</td>\n",
       "      <td>must preface saying not religious but loved bo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Huge Disappointment.</td>\n",
       "      <td>As a big time, long term Trevanian fan, I was ...</td>\n",
       "      <td>big time long term trevanian fan extremely dis...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Wayne is tight but cant hang with Turk.</td>\n",
       "      <td>This album is hot as it wants to be. However C...</td>\n",
       "      <td>album hot want however cash money best album e...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>I read this book when I was in elementary scho...</td>\n",
       "      <td>read book elementary school probably fourth gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Not about Anusara</td>\n",
       "      <td>Although this book is touted on several Anusar...</td>\n",
       "      <td>although book touted several anusara web site ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                             review_title  \\\n",
       "0             2                     This is a great book   \n",
       "1             1                     Huge Disappointment.   \n",
       "2             2  Wayne is tight but cant hang with Turk.   \n",
       "3             2                                Excellent   \n",
       "4             1                        Not about Anusara   \n",
       "\n",
       "                                      review_content  \\\n",
       "0  I must preface this by saying that I am not re...   \n",
       "1  As a big time, long term Trevanian fan, I was ...   \n",
       "2  This album is hot as it wants to be. However C...   \n",
       "3  I read this book when I was in elementary scho...   \n",
       "4  Although this book is touted on several Anusar...   \n",
       "\n",
       "                              review_content_cleaned  \n",
       "0  must preface saying not religious but loved bo...  \n",
       "1  big time long term trevanian fan extremely dis...  \n",
       "2  album hot want however cash money best album e...  \n",
       "3  read book elementary school probably fourth gr...  \n",
       "4  although book touted several anusara web site ...  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_test['review_content_cleaned'] = clean_text(processed_test['review_content'])\n",
    "processed_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3636ef3",
   "metadata": {},
   "source": [
    "## Test Content After Cleaning\n",
    "\n",
    "The test content column is now clean. The same observations apply as for training and validation — we'd expect the cleaned text to be readable, lowercase, and free of formatting noise. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b5e64aec",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review_target</th>\n",
       "      <th>review_title</th>\n",
       "      <th>review_content</th>\n",
       "      <th>review_content_cleaned</th>\n",
       "      <th>review_title_cleaned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>This is a great book</td>\n",
       "      <td>I must preface this by saying that I am not re...</td>\n",
       "      <td>must preface saying not religious but loved bo...</td>\n",
       "      <td>great book</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Huge Disappointment.</td>\n",
       "      <td>As a big time, long term Trevanian fan, I was ...</td>\n",
       "      <td>big time long term trevanian fan extremely dis...</td>\n",
       "      <td>huge disappointment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Wayne is tight but cant hang with Turk.</td>\n",
       "      <td>This album is hot as it wants to be. However C...</td>\n",
       "      <td>album hot want however cash money best album e...</td>\n",
       "      <td>wayne tight but cannot hang turk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Excellent</td>\n",
       "      <td>I read this book when I was in elementary scho...</td>\n",
       "      <td>read book elementary school probably fourth gr...</td>\n",
       "      <td>excellent</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Not about Anusara</td>\n",
       "      <td>Although this book is touted on several Anusar...</td>\n",
       "      <td>although book touted several anusara web site ...</td>\n",
       "      <td>not anusara</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  review_target                             review_title  \\\n",
       "0             2                     This is a great book   \n",
       "1             1                     Huge Disappointment.   \n",
       "2             2  Wayne is tight but cant hang with Turk.   \n",
       "3             2                                Excellent   \n",
       "4             1                        Not about Anusara   \n",
       "\n",
       "                                      review_content  \\\n",
       "0  I must preface this by saying that I am not re...   \n",
       "1  As a big time, long term Trevanian fan, I was ...   \n",
       "2  This album is hot as it wants to be. However C...   \n",
       "3  I read this book when I was in elementary scho...   \n",
       "4  Although this book is touted on several Anusar...   \n",
       "\n",
       "                              review_content_cleaned  \\\n",
       "0  must preface saying not religious but loved bo...   \n",
       "1  big time long term trevanian fan extremely dis...   \n",
       "2  album hot want however cash money best album e...   \n",
       "3  read book elementary school probably fourth gr...   \n",
       "4  although book touted several anusara web site ...   \n",
       "\n",
       "               review_title_cleaned  \n",
       "0                        great book  \n",
       "1               huge disappointment  \n",
       "2  wayne tight but cannot hang turk  \n",
       "3                         excellent  \n",
       "4                       not anusara  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_test['review_title_cleaned'] = clean_text(processed_test['review_title'])\n",
    "processed_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0bc2de8",
   "metadata": {},
   "source": [
    "## Test Titles After Cleaning\n",
    "\n",
    "Titles are cleaned separately and stored alongside content. This might seem like a small detail, but it reflects something we learned in EDA: titles and content carry different kinds of signal. Titles are often more sentiment-dense per word. Having them as a separate, clean column gives future modelling experiments the option to weight them differently or treat them as independent features."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d037f992",
   "metadata": {},
   "source": [
    "Save the cleaned data\n",
    "\n",
    "All three cleaned datasets are saved to `data/processed/`. From this point forward, every modelling notebook loads from here — no one touches the raw files again.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "2c4e029b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved dataframe processed_train.csv to data\\processed\\processed_train.csv\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'csv': WindowsPath('data/processed/processed_train.csv')}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "save(df_base='data/processed', df=processed_train, df_name='processed_train.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "6403bd9f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved dataframe processed_valid.csv to data\\processed\\processed_valid.csv\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'csv': WindowsPath('data/processed/processed_valid.csv')}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "save(df_base='data/processed', df=processed_valid, df_name='processed_valid.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "659e619f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved dataframe processed_test.csv to data\\processed\\processed_test.csv\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'csv': WindowsPath('data/processed/processed_test.csv')}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "save(df_base='data/processed', df=processed_test, df_name='processed_test.csv')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mlqueens",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}