{ "cells": [ { "cell_type": "markdown", "id": "cd2a615d", "metadata": {}, "source": [ "# Data Preprocessing\n", "\n", "EDA gave us a clear picture of what we're working with. Now the job is to make the raw text usable transforming it from messy, inconsistent strings into a clean, uniform representation that a vectoriser can work with reliably.\n", "\n", "Models don't understand that \"GREAT\", \"great\", and \"great!!\" all carry the same meaning. They see three different tokens. Every inconsistency in the raw text is an opportunity for the model to learn a spurious pattern instead of genuine sentiment. Preprocessing is how we close those loopholes before they become problems.\n", "\n", "The decisions made here as what to strip, what to keep, how to combine fields directly shape the feature space. We're not just cleaning data; we're making choices about what information the model is even allowed to see." ] }, { "cell_type": "code", "execution_count": 1, "id": "289e4659", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "eb18f311", "metadata": {}, "source": [ "## Environment Setup\n", "\n", "back to root to call functions from helpers.py file" ] }, { "cell_type": "code", "execution_count": 2, "id": "3eeedf88", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "e:\\AI_ML\\proj\\sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens\\notebooks\n", "E:\\AI_ML\\proj\\sentiment-analysis-of-amazon-reviews-using-machine-learning-ml-queens\n" ] } ], "source": [ "import os\n", "from pathlib import Path\n", "print(Path.cwd())\n", "os.chdir(Path('..').resolve())\n", "from src.utils.helpers import clean_text, save\n", "print(Path.cwd())" ] }, { "cell_type": "markdown", "id": "9da929eb", "metadata": {}, "source": [ "## Loading the Balanced Training Data\n", "\n", "We load the balanced dataset produced at the end of EDA — already cleaned of basic procedures. The `.info()` check below is a quick sanity check that everything still alright." ] }, { "cell_type": "code", "execution_count": 3, "id": "b899920f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_contentreview_content_char_countreview_content_word_count
02GREAT CAMRAI HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ...586108
11not so greatI'm using this book in an introductory organic...57088
21Inaccurate and disappointingI only read the first few chapters and was bom...21440
31Equus 3340Feels cheaply made, the battery contacts were ...19334
42awesome sheets!I love these sheets! They are sleek & smooth w...19838
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 GREAT CAMRA \n", "1 1 not so great \n", "2 1 Inaccurate and disappointing \n", "3 1 Equus 3340 \n", "4 2 awesome sheets! \n", "\n", " review_content \\\n", "0 I HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ... \n", "1 I'm using this book in an introductory organic... \n", "2 I only read the first few chapters and was bom... \n", "3 Feels cheaply made, the battery contacts were ... \n", "4 I love these sheets! They are sleek & smooth w... \n", "\n", " review_content_char_count review_content_word_count \n", "0 586 108 \n", "1 570 88 \n", "2 214 40 \n", "3 193 34 \n", "4 198 38 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "balanced_sample_train = pd.read_csv(r'data/processed/balanced_sample_train.csv', dtype=str, quoting=0)\n", "balanced_sample_train.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "5623cf6a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 79972 entries, 0 to 79971\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 review_target 79972 non-null str \n", " 1 review_title 79972 non-null str \n", " 2 review_content 79972 non-null str \n", " 3 review_content_char_count 79972 non-null str \n", " 4 review_content_word_count 79972 non-null str \n", "dtypes: str(5)\n", "memory usage: 3.1 MB\n" ] } ], "source": [ "balanced_sample_train.info()" ] }, { "cell_type": "markdown", "id": "524acdc6", "metadata": {}, "source": [ "## What We're Looking At\n", "\n", "Three columns matter: `review_title`, `review_content`, and `review_target`. The target is straightforward 1 for negative, 2 for positive. The text columns are where the work is done.\n", "\n", "Titles tend to be short, punchy, and deliberately expressive, customers often condense their entire opinion into a few words. Content is longer and more nuanced, but also noisier. Combining them gives the model access to both the headline sentiment and the full argument behind it, which is why we concatenate them rather than choosing one." ] }, { "cell_type": "markdown", "id": "d5b9401e", "metadata": {}, "source": [ "## Cleaning the Training Text\n", "\n", "The cleaning pipeline does several things in sequence, and the order matters. We concatenate title and content *before* cleaning so the join character doesn't accidentally survive as an artefact then normalise to lowercase, and remove punctuation..." ] }, { "cell_type": "code", "execution_count": 5, "id": "2deb74f4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_contentreview_content_char_countreview_content_word_countreview_content_cleaned
02GREAT CAMRAI HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ...586108great camra dx6340 year love picture good 35m ...
11not so greatI'm using this book in an introductory organic...57088not great using book introductory organic spec...
21Inaccurate and disappointingI only read the first few chapters and was bom...21440inaccurate disappointing read first chapter bo...
31Equus 3340Feels cheaply made, the battery contacts were ...19334equus 3340 feel cheaply made battery contact r...
42awesome sheets!I love these sheets! They are sleek & smooth w...19838awesome sheet love sheet sleek smooth really c...
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 GREAT CAMRA \n", "1 1 not so great \n", "2 1 Inaccurate and disappointing \n", "3 1 Equus 3340 \n", "4 2 awesome sheets! \n", "\n", " review_content \\\n", "0 I HAVE HAD THE DX6340 FOR ABOUT A YEAR.I LOVE ... \n", "1 I'm using this book in an introductory organic... \n", "2 I only read the first few chapters and was bom... \n", "3 Feels cheaply made, the battery contacts were ... \n", "4 I love these sheets! They are sleek & smooth w... \n", "\n", " review_content_char_count review_content_word_count \\\n", "0 586 108 \n", "1 570 88 \n", "2 214 40 \n", "3 193 34 \n", "4 198 38 \n", "\n", " review_content_cleaned \n", "0 great camra dx6340 year love picture good 35m ... \n", "1 not great using book introductory organic spec... \n", "2 inaccurate disappointing read first chapter bo... \n", "3 equus 3340 feel cheaply made battery contact r... \n", "4 awesome sheet love sheet sleek smooth really c... " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_train = balanced_sample_train.copy()\n", "processed_train['review_content_cleaned'] = clean_text(processed_train['review_title'].fillna('') + ' ' + processed_train['review_content'].fillna(''))\n", "processed_train.head()" ] }, { "cell_type": "markdown", "id": "0be1b03e", "metadata": {}, "source": [ "## What the Cleaned Text Looks Like\n", "\n", "The `review_content_cleaned` column should now contain plain, lowercase text with no HTML, no punctuation, and no stray whitespace. Spot-checking a few rows is worth doing here, particularly any that looked unusual in the raw data (very short reviews, reviews with lots of special characters, non-English text that slipped through).\n", "\n", "What we're looking for: the cleaned text should still be readable not near empty or too stripped" ] }, { "cell_type": "markdown", "id": "09704c70", "metadata": {}, "source": [ "## Applying the Same Pipeline to Validation Data\n", "\n", "The validation set must go through exactly the same cleaning steps as training, same function, same parameters, same concatenation logic. Any deviation creates a mismatch between the distribution the model was trained on and the distribution it's being evaluated against, which would make our validation metrics unreliable." ] }, { "cell_type": "code", "execution_count": 6, "id": "9f585815", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_content
02Everything you needThis is a wonderful book. It may have been mea...
11Important note about carrierThe carrier is very cute, and lightweight...ho...
21Not a musical instrument -cannot be playedI bought (elsewhere) one of these harps for my...
32Do I Iike this monitor? Well... I have 2!I have 2 of these babies hooked up to a dual-o...
41Very disappointingThis book is very poorly written and lacks of ...
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 Everything you need \n", "1 1 Important note about carrier \n", "2 1 Not a musical instrument -cannot be played \n", "3 2 Do I Iike this monitor? Well... I have 2! \n", "4 1 Very disappointing \n", "\n", " review_content \n", "0 This is a wonderful book. It may have been mea... \n", "1 The carrier is very cute, and lightweight...ho... \n", "2 I bought (elsewhere) one of these harps for my... \n", "3 I have 2 of these babies hooked up to a dual-o... \n", "4 This book is very poorly written and lacks of ... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_valid = pd.read_csv(r'data/samples/sample_valid.csv', dtype=str, quoting=0)\n", "processed_valid.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "07d9f22d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_contentreview_content_cleaned
02Everything you needThis is a wonderful book. It may have been mea...everything need wonderful book may meant clerg...
11Important note about carrierThe carrier is very cute, and lightweight...ho...important note carrier carrier very cute light...
21Not a musical instrument -cannot be playedI bought (elsewhere) one of these harps for my...not musical instrument cannot played bought el...
32Do I Iike this monitor? Well... I have 2!I have 2 of these babies hooked up to a dual-o...iike monitor well baby hooked dual output digi...
41Very disappointingThis book is very poorly written and lacks of ...very disappointing book very poorly written la...
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 Everything you need \n", "1 1 Important note about carrier \n", "2 1 Not a musical instrument -cannot be played \n", "3 2 Do I Iike this monitor? Well... I have 2! \n", "4 1 Very disappointing \n", "\n", " review_content \\\n", "0 This is a wonderful book. It may have been mea... \n", "1 The carrier is very cute, and lightweight...ho... \n", "2 I bought (elsewhere) one of these harps for my... \n", "3 I have 2 of these babies hooked up to a dual-o... \n", "4 This book is very poorly written and lacks of ... \n", "\n", " review_content_cleaned \n", "0 everything need wonderful book may meant clerg... \n", "1 important note carrier carrier very cute light... \n", "2 not musical instrument cannot played bought el... \n", "3 iike monitor well baby hooked dual output digi... \n", "4 very disappointing book very poorly written la... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_valid['review_content_cleaned'] = clean_text(processed_valid['review_title'].fillna('') + ' ' + processed_valid['review_content'].fillna(''))\n", "processed_valid.head()\n" ] }, { "cell_type": "markdown", "id": "cc1443d5", "metadata": {}, "source": [ "## Validation Data After Cleaning\n", "\n", "The validation set is now in the same form as the training set: a single `review_content_cleaned` column containing lowercased, punctuation-free text. No information from the training set has leaked into this process.\n" ] }, { "cell_type": "markdown", "id": "8168fa1d", "metadata": {}, "source": [ "## Applying the Pipeline to Test Data\n", "\n", "The test set is treated with particular care. We don't examine its label distribution, don't compute statistics on it to inform any decisions, and we certainly don't adjust the cleaning pipeline based on anything we see in it. It's processed mechanically, exactly as training and validation were.\n", "\n", "Note: unlike the training set (which concatenated title + content), the test cleaning here is applied separately to content and title. This gives us flexibility to experiment with different feature combinations at the modelling stage without needing to re-run preprocessing." ] }, { "cell_type": "code", "execution_count": 8, "id": "5eb6491f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_content
02This is a great bookI must preface this by saying that I am not re...
11Huge Disappointment.As a big time, long term Trevanian fan, I was ...
22Wayne is tight but cant hang with Turk.This album is hot as it wants to be. However C...
32ExcellentI read this book when I was in elementary scho...
41Not about AnusaraAlthough this book is touted on several Anusar...
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 This is a great book \n", "1 1 Huge Disappointment. \n", "2 2 Wayne is tight but cant hang with Turk. \n", "3 2 Excellent \n", "4 1 Not about Anusara \n", "\n", " review_content \n", "0 I must preface this by saying that I am not re... \n", "1 As a big time, long term Trevanian fan, I was ... \n", "2 This album is hot as it wants to be. However C... \n", "3 I read this book when I was in elementary scho... \n", "4 Although this book is touted on several Anusar... " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_test = pd.read_csv(r'data/samples/sample_test.csv', dtype=str, quoting=0)\n", "processed_test.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_contentreview_content_cleaned
02This is a great bookI must preface this by saying that I am not re...must preface saying not religious but loved bo...
11Huge Disappointment.As a big time, long term Trevanian fan, I was ...big time long term trevanian fan extremely dis...
22Wayne is tight but cant hang with Turk.This album is hot as it wants to be. However C...album hot want however cash money best album e...
32ExcellentI read this book when I was in elementary scho...read book elementary school probably fourth gr...
41Not about AnusaraAlthough this book is touted on several Anusar...although book touted several anusara web site ...
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 This is a great book \n", "1 1 Huge Disappointment. \n", "2 2 Wayne is tight but cant hang with Turk. \n", "3 2 Excellent \n", "4 1 Not about Anusara \n", "\n", " review_content \\\n", "0 I must preface this by saying that I am not re... \n", "1 As a big time, long term Trevanian fan, I was ... \n", "2 This album is hot as it wants to be. However C... \n", "3 I read this book when I was in elementary scho... \n", "4 Although this book is touted on several Anusar... \n", "\n", " review_content_cleaned \n", "0 must preface saying not religious but loved bo... \n", "1 big time long term trevanian fan extremely dis... \n", "2 album hot want however cash money best album e... \n", "3 read book elementary school probably fourth gr... \n", "4 although book touted several anusara web site ... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_test['review_content_cleaned'] = clean_text(processed_test['review_content'])\n", "processed_test.head()" ] }, { "cell_type": "markdown", "id": "f3636ef3", "metadata": {}, "source": [ "## Test Content After Cleaning\n", "\n", "The test content column is now clean. The same observations apply as for training and validation — we'd expect the cleaned text to be readable, lowercase, and free of formatting noise. " ] }, { "cell_type": "code", "execution_count": 10, "id": "b5e64aec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
review_targetreview_titlereview_contentreview_content_cleanedreview_title_cleaned
02This is a great bookI must preface this by saying that I am not re...must preface saying not religious but loved bo...great book
11Huge Disappointment.As a big time, long term Trevanian fan, I was ...big time long term trevanian fan extremely dis...huge disappointment
22Wayne is tight but cant hang with Turk.This album is hot as it wants to be. However C...album hot want however cash money best album e...wayne tight but cannot hang turk
32ExcellentI read this book when I was in elementary scho...read book elementary school probably fourth gr...excellent
41Not about AnusaraAlthough this book is touted on several Anusar...although book touted several anusara web site ...not anusara
\n", "
" ], "text/plain": [ " review_target review_title \\\n", "0 2 This is a great book \n", "1 1 Huge Disappointment. \n", "2 2 Wayne is tight but cant hang with Turk. \n", "3 2 Excellent \n", "4 1 Not about Anusara \n", "\n", " review_content \\\n", "0 I must preface this by saying that I am not re... \n", "1 As a big time, long term Trevanian fan, I was ... \n", "2 This album is hot as it wants to be. However C... \n", "3 I read this book when I was in elementary scho... \n", "4 Although this book is touted on several Anusar... \n", "\n", " review_content_cleaned \\\n", "0 must preface saying not religious but loved bo... \n", "1 big time long term trevanian fan extremely dis... \n", "2 album hot want however cash money best album e... \n", "3 read book elementary school probably fourth gr... \n", "4 although book touted several anusara web site ... \n", "\n", " review_title_cleaned \n", "0 great book \n", "1 huge disappointment \n", "2 wayne tight but cannot hang turk \n", "3 excellent \n", "4 not anusara " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processed_test['review_title_cleaned'] = clean_text(processed_test['review_title'])\n", "processed_test.head()" ] }, { "cell_type": "markdown", "id": "b0bc2de8", "metadata": {}, "source": [ "## Test Titles After Cleaning\n", "\n", "Titles are cleaned separately and stored alongside content. This might seem like a small detail, but it reflects something we learned in EDA: titles and content carry different kinds of signal. Titles are often more sentiment-dense per word. Having them as a separate, clean column gives future modelling experiments the option to weight them differently or treat them as independent features." ] }, { "cell_type": "markdown", "id": "d037f992", "metadata": {}, "source": [ "Save the cleaned data\n", "\n", "All three cleaned datasets are saved to `data/processed/`. From this point forward, every modelling notebook loads from here — no one touches the raw files again.\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "2c4e029b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved dataframe processed_train.csv to data\\processed\\processed_train.csv\n" ] }, { "data": { "text/plain": [ "{'csv': WindowsPath('data/processed/processed_train.csv')}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "save(df_base='data/processed', df=processed_train, df_name='processed_train.csv')" ] }, { "cell_type": "code", "execution_count": 12, "id": "6403bd9f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved dataframe processed_valid.csv to data\\processed\\processed_valid.csv\n" ] }, { "data": { "text/plain": [ "{'csv': WindowsPath('data/processed/processed_valid.csv')}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "save(df_base='data/processed', df=processed_valid, df_name='processed_valid.csv')" ] }, { "cell_type": "code", "execution_count": 13, "id": "659e619f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved dataframe processed_test.csv to data\\processed\\processed_test.csv\n" ] }, { "data": { "text/plain": [ "{'csv': WindowsPath('data/processed/processed_test.csv')}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "save(df_base='data/processed', df=processed_test, df_name='processed_test.csv')" ] } ], "metadata": { "kernelspec": { "display_name": "mlqueens", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }