{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "4ba6aba8" }, "source": [ "# πŸ€– **Data Collection, Creation, Storage, and Processing**\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jpASMyIQMaAq" }, "source": [ "## **1.** πŸ“¦ Install required packages" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f48c8f8c", "outputId": "8774578b-9294-4814-a823-8f7566e102d1" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n", "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n", "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n", "Requirement already satisfied: textblob in /usr/local/lib/python3.12/dist-packages (0.19.0)\n", "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n", "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.3)\n", "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.61.1)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.4.9)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n", "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n", "Requirement already satisfied: nltk>=3.9 in /usr/local/lib/python3.12/dist-packages (from textblob) (3.9.1)\n", "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (8.3.1)\n", "Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (1.5.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (2025.11.3)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (4.67.3)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" ] } ], "source": [ "!pip install beautifulsoup4 pandas matplotlib seaborn numpy textblob" ] }, { "cell_type": "markdown", "metadata": { "id": "lquNYCbfL9IM" }, "source": [ "## **2.** ⛏ Web-scrape all book titles, prices, and ratings from books.toscrape.com" ] }, { "cell_type": "markdown", "metadata": { "id": "0IWuNpxxYDJF" }, "source": [ "### *a. Initial setup*\n", "Define the base url of the website you will scrape as well as how and what you will scrape" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "91d52125" }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd\n", "import time\n", "\n", "base_url = \"https://books.toscrape.com/catalogue/page-{}.html\"\n", "headers = {\"User-Agent\": \"Mozilla/5.0\"}\n", "\n", "titles, prices, ratings = [], [], []" ] }, { "cell_type": "markdown", "metadata": { "id": "oCdTsin2Yfp3" }, "source": [ "### *b. Fill titles, prices, and ratings from the web pages*" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "id": "xqO5Y3dnYhxt" }, "outputs": [], "source": [ "# Loop through all 50 pages\n", "for page in range(1, 51):\n", " url = base_url.format(page)\n", " response = requests.get(url, headers=headers)\n", " soup = BeautifulSoup(response.content, \"html.parser\")\n", " books = soup.find_all(\"article\", class_=\"product_pod\")\n", "\n", " for book in books:\n", " titles.append(book.h3.a[\"title\"])\n", " prices.append(float(book.find(\"p\", class_=\"price_color\").text[1:]))\n", " ratings.append(book.p.get(\"class\")[1])\n", "\n", " time.sleep(0.5) # polite scraping delay" ] }, { "cell_type": "markdown", "metadata": { "id": "T0TOeRC4Yrnn" }, "source": [ "### *c. βœ‹πŸ»πŸ›‘β›”οΈ Create a dataframe df_books that contains the now complete \"title\", \"price\", and \"rating\" objects*" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "id": "l5FkkNhUYTHh" }, "outputs": [], "source": [ "df_books = pd.DataFrame({\n", " \"title\": titles,\n", " \"price\": prices,\n", " \"rating\": ratings\n", "})" ] }, { "cell_type": "markdown", "metadata": { "id": "duI5dv3CZYvF" }, "source": [ "### *d. Save web-scraped dataframe either as a CSV or Excel file*" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "lC1U_YHtZifh" }, "outputs": [], "source": [ "# πŸ’Ύ Save to CSV\n", "df_books.to_csv(\"books_data.csv\", index=False)\n", "\n", "# πŸ’Ύ Or save to Excel\n", "# df_books.to_excel(\"books_data.xlsx\", index=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "qMjRKMBQZlJi" }, "source": [ "### *e. βœ‹πŸ»πŸ›‘β›”οΈ View first fiew lines*" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "O_wIvTxYZqCK" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "p-1Pr2szaqLk" }, "source": [ "## **3.** 🧩 Create a meaningful connection between real & synthetic datasets" ] }, { "cell_type": "markdown", "metadata": { "id": "SIaJUGIpaH4V" }, "source": [ "### *a. Initial setup*" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "-gPXGcRPuV_9" }, "outputs": [], "source": [ "import numpy as np\n", "import random\n", "from datetime import datetime\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "random.seed(2025)\n", "np.random.seed(2025)" ] }, { "cell_type": "markdown", "metadata": { "id": "pY4yCoIuaQqp" }, "source": [ "### *b. Generate popularity scores based on rating (with some randomness) with a generate_popularity_score function*" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "id": "mnd5hdAbaNjz" }, "outputs": [], "source": [ "def generate_popularity_score(rating):\n", " base = {\"One\": 2, \"Two\": 3, \"Three\": 3, \"Four\": 4, \"Five\": 4}.get(rating, 3)\n", " trend_factor = random.choices([-1, 0, 1], weights=[1, 3, 2])[0]\n", " return int(np.clip(base + trend_factor, 1, 5))" ] }, { "cell_type": "markdown", "metadata": { "id": "n4-TaNTFgPak" }, "source": [ "### *c. βœ‹πŸ»πŸ›‘β›”οΈ Run the function to create a \"popularity_score\" column from \"rating\"*" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "id": "V-G3OCUCgR07" }, "outputs": [], "source": [ "df_books[\"popularity_score\"] = df_books[\"rating\"].apply(generate_popularity_score)" ] }, { "cell_type": "markdown", "metadata": { "id": "HnngRNTgacYt" }, "source": [ "### *d. Decide on the sentiment_label based on the popularity score with a get_sentiment function*" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "id": "kUtWmr8maZLZ" }, "outputs": [], "source": [ "def get_sentiment(popularity_score):\n", " if popularity_score <= 2:\n", " return \"negative\"\n", " elif popularity_score == 3:\n", " return \"neutral\"\n", " else:\n", " return \"positive\"" ] }, { "cell_type": "markdown", "metadata": { "id": "HF9F9HIzgT7Z" }, "source": [ "### *e. βœ‹πŸ»πŸ›‘β›”οΈ Run the function to create a \"sentiment_label\" column from \"popularity_score\"*" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "id": "tafQj8_7gYCG" }, "outputs": [], "source": [ "df_books[\"sentiment_label\"] = df_books[\"popularity_score\"].apply(get_sentiment)" ] }, { "cell_type": "markdown", "metadata": { "id": "T8AdKkmASq9a" }, "source": [ "## **4.** πŸ“ˆ Generate synthetic book sales data of 18 months" ] }, { "cell_type": "markdown", "metadata": { "id": "OhXbdGD5fH0c" }, "source": [ "### *a. Create a generate_sales_profit function that would generate sales patterns based on sentiment_label (with some randomness)*" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "qkVhYPXGbgEn" }, "outputs": [], "source": [ "def generate_sales_profile(sentiment):\n", " months = pd.date_range(end=datetime.today(), periods=18, freq=\"M\")\n", "\n", " if sentiment == \"positive\":\n", " base = random.randint(200, 300)\n", " trend = np.linspace(base, base + random.randint(20, 60), len(months))\n", " elif sentiment == \"negative\":\n", " base = random.randint(20, 80)\n", " trend = np.linspace(base, base - random.randint(10, 30), len(months))\n", " else: # neutral\n", " base = random.randint(80, 160)\n", " trend = np.full(len(months), base + random.randint(-10, 10))\n", "\n", " seasonality = 10 * np.sin(np.linspace(0, 3 * np.pi, len(months)))\n", " noise = np.random.normal(0, 5, len(months))\n", " monthly_sales = np.clip(trend + seasonality + noise, a_min=0, a_max=None).astype(int)\n", "\n", " return list(zip(months.strftime(\"%Y-%m\"), monthly_sales))" ] }, { "cell_type": "markdown", "metadata": { "id": "L2ak1HlcgoTe" }, "source": [ "### *b. Run the function as part of building sales_data*" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "id": "SlJ24AUafoDB" }, "outputs": [], "source": [ "sales_data = []\n", "for _, row in df_books.iterrows():\n", " records = generate_sales_profile(row[\"sentiment_label\"])\n", " for month, units in records:\n", " sales_data.append({\n", " \"title\": row[\"title\"],\n", " \"month\": month,\n", " \"units_sold\": units,\n", " \"sentiment_label\": row[\"sentiment_label\"]\n", " })" ] }, { "cell_type": "markdown", "metadata": { "id": "4IXZKcCSgxnq" }, "source": [ "### *c. βœ‹πŸ»πŸ›‘β›”οΈ Create a df_sales DataFrame from sales_data*" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "id": "wcN6gtiZg-ws" }, "outputs": [], "source": [ "df_sales = pd.DataFrame(sales_data)" ] }, { "cell_type": "markdown", "metadata": { "id": "EhIjz9WohAmZ" }, "source": [ "### *d. Save df_sales as synthetic_sales_data.csv & view first few lines*" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MzbZvLcAhGaH", "outputId": "3ec742ad-27bb-4b28-e6c5-139b4908daf7" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " title month units_sold sentiment_label\n", "0 A Light in the Attic 2024-09 100 neutral\n", "1 A Light in the Attic 2024-10 109 neutral\n", "2 A Light in the Attic 2024-11 102 neutral\n", "3 A Light in the Attic 2024-12 107 neutral\n", "4 A Light in the Attic 2025-01 108 neutral\n" ] } ], "source": [ "df_sales.to_csv(\"synthetic_sales_data.csv\", index=False)\n", "\n", "print(df_sales.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "7g9gqBgQMtJn" }, "source": [ "## **5.** 🎯 Generate synthetic customer reviews" ] }, { "cell_type": "markdown", "metadata": { "id": "Gi4y9M9KuDWx" }, "source": [ "### *a. βœ‹πŸ»πŸ›‘β›”οΈ Ask ChatGPT to create a list of 50 distinct generic book review texts for the sentiment labels \"positive\", \"neutral\", and \"negative\" called synthetic_reviews_by_sentiment*" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "id": "b3cd2a50" }, "outputs": [], "source": [ "synthetic_reviews_by_sentiment = {\n", " \"positive\": [\n", " \"A compelling and heartwarming read that stayed with me long after I finished.\",\n", " \"Brilliantly written! The characters were unforgettable and the plot was engaging.\",\n", " \"One of the best books I've read this year β€” inspiring and emotionally rich.\",\n", " \"Absolutely captivating from start to finish. Highly recommend!\",\n", " \"A masterpiece of storytelling. I couldn't put it down.\",\n", " \"Incredible character development and a truly unique plot. Loved it.\",\n", " \"This book is a gem. Profound, moving, and beautifully written.\",\n", " \"So insightful and thought-provoking. A must-read for everyone.\",\n", " \"An exhilarating journey! The pacing was perfect and the ending satisfying.\",\n", " \"Simply perfect. Every sentence was a delight.\",\n", " \"A truly enriching reading experience. I'm already looking forward to rereading it.\",\n", " \"Fantastic world-building and unforgettable characters. Pure joy.\",\n", " \"This author truly understands the human heart. A powerful read.\",\n", " \"Gripping and suspenseful. I was on the edge of my seat the whole time.\",\n", " \"A profound exploration of themes that resonate deeply. Bravo!\",\n", " \"Filled with wisdom and charm. This book has a permanent place on my shelf.\",\n", " \"An absolute page-turner. I devoured it in one sitting.\",\n", " \"The prose is exquisite, and the story is simply beautiful.\",\n", " \"Highly original and utterly brilliant. A true standout.\",\n", " \"A wonderful escape. I felt completely immersed in this world.\",\n", " \"Charming, witty, and surprisingly deep. A delightful read.\",\n", " \"Could not recommend this enough! It's an instant classic.\",\n", " \"A powerful narrative that will stay with you long after you've finished.\",\n", " \"The perfect blend of adventure, mystery, and heart. Loved it all.\",\n", " \"Every chapter was a new discovery. A truly engaging book.\",\n", " \"Simply magnificent! The best book I've read all year.\",\n", " \"A triumph of imagination and skill. Absolutely stunning.\",\n", " \"The characters leaped off the page. I felt like I knew them.\",\n", " \"A beautifully crafted story with a powerful message.\",\n", " \"So much more than just a story; it's an experience.\",\n", " \"I'm thoroughly impressed. A brilliant debut/continuation.\",\n", " \"This book has it all: suspense, romance, and thought-provoking ideas.\",\n", " \"A truly unforgettable tale. I'll be recommending this to everyone.\",\n", " \"The writing is so vivid and evocative. A true artist at work.\",\n", " \"An inspiring story that filled me with hope and wonder.\",\n", " \"Pure magic from start to finish. I'm so glad I read this.\",\n", " \"A cleverly constructed plot with a satisfying resolution.\",\n", " \"Filled with humor and heart. A joy to read.\",\n", " \"A literary masterpiece that deserves all the accolades.\",\n", " \"Absolutely adore this book! It brought me so much joy.\",\n", " \"Deeply moving and incredibly well-written. A truly special book.\",\n", " \"A fantastic read that will transport you to another world.\",\n", " \"The plot twists kept me guessing until the very end.\",\n", " \"Insightful and beautifully articulated. A true work of art.\",\n", " \"I couldn't ask for a better book. It exceeded all expectations.\",\n", " \"A powerful and important story that needs to be told.\",\n", " \"Refreshing and original. A breath of fresh air.\",\n", " \"Every detail was perfect. A testament to the author's talent.\",\n", " \"This book is a journey you won't soon forget. Highly recommend.\",\n", " \"A truly delightful and enchanting story.\"\n", " ],\n", " \"neutral\": [\n", " \"An average book β€” not great, but not bad either.\",\n", " \"Some parts really stood out, others felt a bit flat.\",\n", " \"It was okay overall. A decent way to pass the time.\",\n", " \"Had potential that went unrealized.\",\n", " \"The themes were solid, but not well explored.\",\n", " \"It simply lacked that emotional punch.\",\n", " \"Serviceable but not something I'd go out of my way to recommend.\",\n", " \"Standard fare with some promise.\",\n", " \"A mixed bag of strong moments and weaker sections.\",\n", " \"Neither impressed nor disappointed me.\",\n", " \"Felt a bit generic, but readable.\",\n", " \"I appreciated some aspects, but others left me wanting more.\",\n", " \"A middle-of-the-road experience.\",\n", " \"It's fine. Nothing particularly memorable.\",\n", " \"The plot was a bit slow in places.\",\n", " \"Characters were okay, but not very deep.\",\n", " \"I finished it, but I won't rush to reread it.\",\n", " \"Pretty standard genre fare.\",\n", " \"Didn't hate it, didn't love it.\",\n", " \"Competent, but not captivating.\",\n", " \"It had its moments, but they were few and far between.\",\n", " \"A largely forgettable read.\",\n", " \"The writing was adequate, nothing more.\",\n", " \"Could have been better, could have been worse.\",\n", " \"Just a simple story, well-executed in parts.\",\n", " \"I found it somewhat predictable.\",\n", " \"Not bad for a quick read, but not engaging enough for me.\",\n", " \"The premise was interesting, but the execution was lacking.\",\n", " \"I'm indifferent to this one.\",\n", " \"A perfectly acceptable, if uninspired, book.\",\n", " \"Some good ideas, but they didn't quite gel.\",\n", " \"It's a book that exists, and I read it.\",\n", " \"I've read better, I've read worse.\",\n", " \"It didn't grab me, but I didn't dislike it either.\",\n", " \"A solid effort, just not my cup of tea.\",\n", " \"The story was coherent, but lacked spark.\",\n", " \"I'd probably forget about this book in a few weeks.\",\n", " \"It passed the time, which is something.\",\n", " \"Felt like it needed another round of editing.\",\n", " \"The pacing was uneven.\",\n", " \"I had higher hopes for this one.\",\n", " \"A decent attempt, but nothing groundbreaking.\",\n", " \"Not much to say about it, really.\",\n", " \"It was readable, I'll give it that.\",\n", " \"The ending was a bit anticlimactic.\",\n", " \"I wouldn't discourage someone from reading it, but wouldn't highly recommend.\",\n", " \"Some interesting concepts, but poorly developed.\",\n", " \"I didn't feel any strong emotions while reading.\",\n", " \"It's a book that fulfills its purpose, nothing more.\",\n", " \"Left me feeling neither satisfied nor disappointed.\"\n", " ],\n", " \"negative\": [\n", " \"I struggled to get through this one β€” it just didn’t grab me.\",\n", " \"The plot was confusing and the characters felt underdeveloped.\",\n", " \"Disappointing. I had high hopes, but they weren't met.\",\n", " \"A complete waste of time. Avoid at all costs.\",\n", " \"Poorly written and incredibly boring.\",\n", " \"The storyline was a mess, and the dialogue was unnatural.\",\n", " \"I found it nearly unreadable. A huge disappointment.\",\n", " \"Nothing about this book worked for me. Utterly terrible.\",\n", " \"Confusing, frustrating, and ultimately unrewarding.\",\n", " \"Wish I could unread this. A truly awful book.\",\n", " \"The premise was interesting, but the execution was terrible.\",\n", " \"Flat characters and a nonsensical plot. Don't bother.\",\n", " \"I kept waiting for it to get better, but it never did.\",\n", " \"A genuinely painful reading experience.\",\n", " \"This book felt like a chore to finish.\",\n", " \"Absolutely dreadful. I regret every minute I spent on it.\",\n", " \"Uninspired, unoriginal, and unengaging.\",\n", " \"The writing was clunky, and the plot was full of holes.\",\n", " \"I can't believe this got published. Truly awful.\",\n", " \"A frustrating and disappointing read from beginning to end.\",\n", " \"Lacked any redeeming qualities. Do not recommend.\",\n", " \"This book is a prime example of how not to write a story.\",\n", " \"Made no sense at all. A complete failure.\",\n", " \"I'm angry I wasted my time on this.\",\n", " \"The characters were annoying, and the plot was ludicrous.\",\n", " \"This book was a major letdown in every possible way.\",\n", " \"Filled with clichΓ©s and predictable twists.\",\n", " \"I honestly have nothing positive to say about this book.\",\n", " \"An amateurish attempt at storytelling.\",\n", " \"The prose was cringeworthy, and the plot was absurd.\",\n", " \"I wanted to like it, but it was just so bad.\",\n", " \"A total disaster of a book.\",\n", " \"Completely failed to engage me. Boring.\",\n", " \"One of the worst books I've ever read.\",\n", " \"Nothing but wasted potential. So disappointing.\",\n", " \"The author clearly didn't care about the story or characters.\",\n", " \"A truly unpleasant reading experience.\",\n", " \"I forced myself to finish it, but it wasn't worth it.\",\n", " \"A mess of a story with no clear direction.\",\n", " \"This book is a perfect example of what not to do.\",\n", " \"Could not connect with any of the characters.\",\n", " \"I feel cheated after reading this. So bad.\",\n", " \"The pacing was terrible, and the plot dragged.\",\n", " \"An utterly forgettable and poorly executed novel.\",\n", " \"I'm genuinely puzzled by how this book got good reviews.\",\n", " \"A truly awful book that should be avoided.\",\n", " \"The writing was so bad it distracted from the story.\",\n", " \"This book tested my patience beyond its limits.\",\n", " \"I have no idea what the author was trying to achieve.\",\n", " \"A thoroughly unenjoyable experience.\"\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "fQhfVaDmuULT" }, "source": [ "### *b. Generate 10 reviews per book using random sampling from the corresponding 50*" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "l2SRc3PjuTGM" }, "outputs": [], "source": [ "review_rows = []\n", "for _, row in df_books.iterrows():\n", " title = row['title']\n", " sentiment_label = row['sentiment_label']\n", " review_pool = synthetic_reviews_by_sentiment[sentiment_label]\n", " # Ensure we don't try to sample more than available reviews\n", " num_samples = min(10, len(review_pool))\n", " sampled_reviews = random.sample(review_pool, num_samples)\n", " for review_text in sampled_reviews:\n", " review_rows.append({\n", " \"title\": title,\n", " \"sentiment_label\": sentiment_label,\n", " \"review_text\": review_text,\n", " \"rating\": row['rating'],\n", " \"popularity_score\": row['popularity_score']\n", " })" ] }, { "cell_type": "markdown", "metadata": { "id": "bmJMXF-Bukdm" }, "source": [ "### *c. Create the final dataframe df_reviews & save it as synthetic_book_reviews.csv*" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "id": "ZUKUqZsuumsp" }, "outputs": [], "source": [ "df_reviews = pd.DataFrame(review_rows)\n", "df_reviews.to_csv(\"synthetic_book_reviews.csv\", index=False)" ] }, { "cell_type": "markdown", "source": [ "### *c. inputs for R*" ], "metadata": { "id": "_602pYUS3gY5" } }, { "cell_type": "code", "execution_count": 52, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3946e521", "outputId": "e46afeac-ccff-4658-d4fa-841f401e5447" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "βœ… Wrote synthetic_title_level_features.csv\n", "βœ… Wrote synthetic_monthly_revenue_series.csv\n" ] } ], "source": [ "import numpy as np\n", "\n", "def _safe_num(s):\n", " return pd.to_numeric(\n", " pd.Series(s).astype(str).str.replace(r\"[^0-9.]\", \"\", regex=True),\n", " errors=\"coerce\"\n", " )\n", "\n", "# --- Clean book metadata (price/rating) ---\n", "df_books_r = df_books.copy()\n", "if \"price\" in df_books_r.columns:\n", " df_books_r[\"price\"] = _safe_num(df_books_r[\"price\"])\n", "if \"rating\" in df_books_r.columns:\n", " df_books_r[\"rating\"] = _safe_num(df_books_r[\"rating\"])\n", "\n", "df_books_r[\"title\"] = df_books_r[\"title\"].astype(str).str.strip()\n", "\n", "# --- Clean sales ---\n", "df_sales_r = df_sales.copy()\n", "df_sales_r[\"title\"] = df_sales_r[\"title\"].astype(str).str.strip()\n", "df_sales_r[\"month\"] = pd.to_datetime(df_sales_r[\"month\"], errors=\"coerce\")\n", "df_sales_r[\"units_sold\"] = _safe_num(df_sales_r[\"units_sold\"])\n", "\n", "# --- Clean reviews ---\n", "df_reviews_r = df_reviews.copy()\n", "df_reviews_r[\"title\"] = df_reviews_r[\"title\"].astype(str).str.strip()\n", "df_reviews_r[\"sentiment_label\"] = df_reviews_r[\"sentiment_label\"].astype(str).str.lower().str.strip()\n", "if \"rating\" in df_reviews_r.columns:\n", " df_reviews_r[\"rating\"] = _safe_num(df_reviews_r[\"rating\"])\n", "if \"popularity_score\" in df_reviews_r.columns:\n", " df_reviews_r[\"popularity_score\"] = _safe_num(df_reviews_r[\"popularity_score\"])\n", "\n", "# --- Sentiment shares per title (from reviews) ---\n", "sent_counts = (\n", " df_reviews_r.groupby([\"title\", \"sentiment_label\"])\n", " .size()\n", " .unstack(fill_value=0)\n", ")\n", "for lab in [\"positive\", \"neutral\", \"negative\"]:\n", " if lab not in sent_counts.columns:\n", " sent_counts[lab] = 0\n", "\n", "sent_counts[\"total_reviews\"] = sent_counts[[\"positive\", \"neutral\", \"negative\"]].sum(axis=1)\n", "den = sent_counts[\"total_reviews\"].replace(0, np.nan)\n", "sent_counts[\"share_positive\"] = sent_counts[\"positive\"] / den\n", "sent_counts[\"share_neutral\"] = sent_counts[\"neutral\"] / den\n", "sent_counts[\"share_negative\"] = sent_counts[\"negative\"] / den\n", "sent_counts = sent_counts.reset_index()\n", "\n", "# --- Sales aggregation per title ---\n", "sales_by_title = (\n", " df_sales_r.dropna(subset=[\"title\"])\n", " .groupby(\"title\", as_index=False)\n", " .agg(\n", " months_observed=(\"month\", \"nunique\"),\n", " avg_units_sold=(\"units_sold\", \"mean\"),\n", " total_units_sold=(\"units_sold\", \"sum\"),\n", " )\n", ")\n", "\n", "# --- Title-level features (join sales + books + sentiment) ---\n", "df_title = (\n", " sales_by_title\n", " .merge(df_books_r[[\"title\", \"price\", \"rating\"]], on=\"title\", how=\"left\")\n", " .merge(sent_counts[[\"title\", \"share_positive\", \"share_neutral\", \"share_negative\", \"total_reviews\"]],\n", " on=\"title\", how=\"left\")\n", ")\n", "\n", "df_title[\"avg_revenue\"] = df_title[\"avg_units_sold\"] * df_title[\"price\"]\n", "df_title[\"total_revenue\"] = df_title[\"total_units_sold\"] * df_title[\"price\"]\n", "\n", "df_title.to_csv(\"synthetic_title_level_features.csv\", index=False)\n", "print(\"βœ… Wrote synthetic_title_level_features.csv\")\n", "\n", "# --- Monthly revenue series (proxy: units_sold * price) ---\n", "monthly_rev = (\n", " df_sales_r.merge(df_books_r[[\"title\", \"price\"]], on=\"title\", how=\"left\")\n", ")\n", "monthly_rev[\"revenue\"] = monthly_rev[\"units_sold\"] * monthly_rev[\"price\"]\n", "\n", "df_monthly = (\n", " monthly_rev.dropna(subset=[\"month\"])\n", " .groupby(\"month\", as_index=False)[\"revenue\"]\n", " .sum()\n", " .rename(columns={\"revenue\": \"total_revenue\"})\n", " .sort_values(\"month\")\n", ")\n", "# if revenue is all NA (e.g., missing price), fallback to units_sold as a teaching proxy\n", "if df_monthly[\"total_revenue\"].notna().sum() == 0:\n", " df_monthly = (\n", " df_sales_r.dropna(subset=[\"month\"])\n", " .groupby(\"month\", as_index=False)[\"units_sold\"]\n", " .sum()\n", " .rename(columns={\"units_sold\": \"total_revenue\"})\n", " .sort_values(\"month\")\n", " )\n", "\n", "df_monthly[\"month\"] = pd.to_datetime(df_monthly[\"month\"], errors=\"coerce\").dt.strftime(\"%Y-%m-%d\")\n", "df_monthly.to_csv(\"synthetic_monthly_revenue_series.csv\", index=False)\n", "print(\"βœ… Wrote synthetic_monthly_revenue_series.csv\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "RYvGyVfXuo54" }, "source": [ "### *d. βœ‹πŸ»πŸ›‘β›”οΈ View the first few lines*" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xfE8NMqOurKo", "outputId": "df2c4cf3-047a-4cc7-d624-47b2bfa2280d" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " title sentiment_label \\\n", "0 A Light in the Attic neutral \n", "1 A Light in the Attic neutral \n", "2 A Light in the Attic neutral \n", "3 A Light in the Attic neutral \n", "4 A Light in the Attic neutral \n", "\n", " review_text rating popularity_score \n", "0 Some interesting concepts, but poorly developed. Three 3 \n", "1 The writing was adequate, nothing more. Three 3 \n", "2 A solid effort, just not my cup of tea. Three 3 \n", "3 A mixed bag of strong moments and weaker secti... Three 3 \n", "4 The ending was a bit anticlimactic. Three 3 \n" ] } ], "source": [ "print(df_reviews.head())" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }