{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "4ba6aba8" }, "source": [ "# 🤖 **Data Collection, Creation, Storage, and Processing**\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jpASMyIQMaAq" }, "source": [ "## **1.** 📦 Install required packages" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f48c8f8c", "outputId": "d47b18f8-74f7-4857-ebe4-805018b68572", "collapsed": true }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n", "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n", "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n", "Requirement already satisfied: textblob in /usr/local/lib/python3.12/dist-packages (0.19.0)\n", "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n", "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.3)\n", "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.61.1)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.4.9)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n", "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n", "Requirement already satisfied: nltk>=3.9 in /usr/local/lib/python3.12/dist-packages (from textblob) (3.9.1)\n", "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (8.3.1)\n", "Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (1.5.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (2025.11.3)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (4.67.3)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" ] } ], "source": [ "!pip install beautifulsoup4 pandas matplotlib seaborn numpy textblob" ] }, { "cell_type": "markdown", "metadata": { "id": "lquNYCbfL9IM" }, "source": [ "## **2.** ⛏ Web-scrape all book titles, prices, and ratings from books.toscrape.com" ] }, { "cell_type": "markdown", "metadata": { "id": "0IWuNpxxYDJF" }, "source": [ "### *a. Initial setup*\n", "Define the base url of the website you will scrape as well as how and what you will scrape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "91d52125" }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd\n", "import time\n", "\n", "base_url = \"https://books.toscrape.com/catalogue/page-{}.html\"\n", "headers = {\"User-Agent\": \"Mozilla/5.0\"}\n", "\n", "titles, prices, ratings = [], [], []" ] }, { "cell_type": "markdown", "metadata": { "id": "oCdTsin2Yfp3" }, "source": [ "### *b. Fill titles, prices, and ratings from the web pages*" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "id": "xqO5Y3dnYhxt" }, "outputs": [], "source": [ "# Loop through all 50 pages\n", "for page in range(1, 51):\n", " url = base_url.format(page)\n", " response = requests.get(url, headers=headers)\n", " soup = BeautifulSoup(response.content, \"html.parser\")\n", " books = soup.find_all(\"article\", class_=\"product_pod\")\n", "\n", " for book in books:\n", " titles.append(book.h3.a[\"title\"])\n", " prices.append(float(book.find(\"p\", class_=\"price_color\").text[1:]))\n", " ratings.append(book.p.get(\"class\")[1])\n", "\n", " time.sleep(0.5) # polite scraping delay" ] }, { "cell_type": "markdown", "metadata": { "id": "T0TOeRC4Yrnn" }, "source": [ "### *c. ✋🏻🛑⛔️ Create a dataframe df_books that contains the now complete \"title\", \"price\", and \"rating\" objects*" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "id": "l5FkkNhUYTHh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "65bf5d74-5a0a-46af-dcff-51330ee61c29" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "DataFrame created with 1000 rows.\n" ] } ], "source": [ "# Create the dataframe from the scraped lists\n", "df_books = pd.DataFrame({\n", " \"title\": titles,\n", " \"price\": prices,\n", " \"rating\": ratings\n", "})\n", "\n", "# Display the shape to confirm all 1000 books (20 per page * 50 pages) were captured\n", "print(f\"DataFrame created with {df_books.shape[0]} rows.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "duI5dv3CZYvF" }, "source": [ "### *d. Save web-scraped dataframe either as a CSV or Excel file*" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "id": "lC1U_YHtZifh" }, "outputs": [], "source": [ "# 💾 Save to CSV\n", "df_books.to_csv(\"books_data.csv\", index=False)\n", "\n", "# 💾 Or save to Excel\n", "# df_books.to_excel(\"books_data.xlsx\", index=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "qMjRKMBQZlJi" }, "source": [ "### *e. ✋🏻🛑⛔️ View first fiew lines*" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "id": "O_wIvTxYZqCK", "outputId": "655b14dd-24a1-4f43-e66c-679699568163" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " title price rating\n", "0 A Light in the Attic 51.77 Three\n", "1 Tipping the Velvet 53.74 One\n", "2 Soumission 50.10 One\n", "3 Sharp Objects 47.82 Four\n", "4 Sapiens: A Brief History of Humankind 54.23 Five" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlepricerating
0A Light in the Attic51.77Three
1Tipping the Velvet53.74One
2Soumission50.10One
3Sharp Objects47.82Four
4Sapiens: A Brief History of Humankind54.23Five
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df_books\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tipping the Velvet\",\n \"Sapiens: A Brief History of Humankind\",\n \"Soumission\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.647672562837028,\n \"min\": 47.82,\n \"max\": 54.23,\n \"num_unique_values\": 5,\n \"samples\": [\n 53.74,\n 54.23,\n 50.1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"One\",\n \"Five\",\n \"Three\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ], "source": [ "display(df_books.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "p-1Pr2szaqLk" }, "source": [ "## **3.** 🧩 Create a meaningful connection between real & synthetic datasets" ] }, { "cell_type": "markdown", "metadata": { "id": "SIaJUGIpaH4V" }, "source": [ "### *a. Initial setup*" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "id": "-gPXGcRPuV_9" }, "outputs": [], "source": [ "import numpy as np\n", "import random\n", "from datetime import datetime\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "random.seed(2025)\n", "np.random.seed(2025)" ] }, { "cell_type": "markdown", "metadata": { "id": "pY4yCoIuaQqp" }, "source": [ "### *b. Generate popularity scores based on rating (with some randomness) with a generate_popularity_score function*" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "mnd5hdAbaNjz" }, "outputs": [], "source": [ "def generate_popularity_score(rating):\n", " base = {\"One\": 2, \"Two\": 3, \"Three\": 3, \"Four\": 4, \"Five\": 4}.get(rating, 3)\n", " trend_factor = random.choices([-1, 0, 1], weights=[1, 3, 2])[0]\n", " return int(np.clip(base + trend_factor, 1, 5))" ] }, { "cell_type": "markdown", "metadata": { "id": "n4-TaNTFgPak" }, "source": [ "### *c. ✋🏻🛑⛔️ Run the function to create a \"popularity_score\" column from \"rating\"*" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "id": "V-G3OCUCgR07", "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "outputId": "ea986b4e-c466-421d-d691-13c726ba76bf" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " title price rating popularity_score\n", "0 A Light in the Attic 51.77 Three 3\n", "1 Tipping the Velvet 53.74 One 2\n", "2 Soumission 50.10 One 2\n", "3 Sharp Objects 47.82 Four 4\n", "4 Sapiens: A Brief History of Humankind 54.23 Five 3" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlepriceratingpopularity_score
0A Light in the Attic51.77Three3
1Tipping the Velvet53.74One2
2Soumission50.10One2
3Sharp Objects47.82Four4
4Sapiens: A Brief History of Humankind54.23Five3
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df_books\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tipping the Velvet\",\n \"Sapiens: A Brief History of Humankind\",\n \"Soumission\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.647672562837028,\n \"min\": 47.82,\n \"max\": 54.23,\n \"num_unique_values\": 5,\n \"samples\": [\n 53.74,\n 54.23,\n 50.1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"One\",\n \"Five\",\n \"Three\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"popularity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 2,\n \"max\": 4,\n \"num_unique_values\": 3,\n \"samples\": [\n 3,\n 2,\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ], "source": [ "df_books[\"popularity_score\"] = df_books[\"rating\"].apply(generate_popularity_score)\n", "display(df_books.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "HnngRNTgacYt" }, "source": [ "### *d. Decide on the sentiment_label based on the popularity score with a get_sentiment function*" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "id": "kUtWmr8maZLZ" }, "outputs": [], "source": [ "def get_sentiment(popularity_score):\n", " if popularity_score <= 2:\n", " return \"negative\"\n", " elif popularity_score == 3:\n", " return \"neutral\"\n", " else:\n", " return \"positive\"" ] }, { "cell_type": "markdown", "metadata": { "id": "HF9F9HIzgT7Z" }, "source": [ "### *e. ✋🏻🛑⛔️ Run the function to create a \"sentiment_label\" column from \"popularity_score\"*" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "tafQj8_7gYCG", "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "outputId": "c9f69ec2-4fc9-4229-8d7e-a74b212123ca" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " title price rating popularity_score \\\n", "0 A Light in the Attic 51.77 Three 3 \n", "1 Tipping the Velvet 53.74 One 2 \n", "2 Soumission 50.10 One 2 \n", "3 Sharp Objects 47.82 Four 4 \n", "4 Sapiens: A Brief History of Humankind 54.23 Five 3 \n", "\n", " sentiment_label \n", "0 neutral \n", "1 negative \n", "2 negative \n", "3 positive \n", "4 neutral " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlepriceratingpopularity_scoresentiment_label
0A Light in the Attic51.77Three3neutral
1Tipping the Velvet53.74One2negative
2Soumission50.10One2negative
3Sharp Objects47.82Four4positive
4Sapiens: A Brief History of Humankind54.23Five3neutral
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df_books\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tipping the Velvet\",\n \"Sapiens: A Brief History of Humankind\",\n \"Soumission\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.647672562837028,\n \"min\": 47.82,\n \"max\": 54.23,\n \"num_unique_values\": 5,\n \"samples\": [\n 53.74,\n 54.23,\n 50.1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"One\",\n \"Five\",\n \"Three\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"popularity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 2,\n \"max\": 4,\n \"num_unique_values\": 3,\n \"samples\": [\n 3,\n 2,\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_label\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"neutral\",\n \"negative\",\n \"positive\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ], "source": [ "df_books[\"sentiment_label\"] = df_books[\"popularity_score\"].apply(get_sentiment)\n", "display(df_books.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "T8AdKkmASq9a" }, "source": [ "## **4.** 📈 Generate synthetic book sales data of 18 months" ] }, { "cell_type": "markdown", "metadata": { "id": "OhXbdGD5fH0c" }, "source": [ "### *a. Create a generate_sales_profit function that would generate sales patterns based on sentiment_label (with some randomness)*" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "qkVhYPXGbgEn" }, "outputs": [], "source": [ "def generate_sales_profile(sentiment):\n", " months = pd.date_range(end=datetime.today(), periods=18, freq=\"M\")\n", "\n", " if sentiment == \"positive\":\n", " base = random.randint(200, 300)\n", " trend = np.linspace(base, base + random.randint(20, 60), len(months))\n", " elif sentiment == \"negative\":\n", " base = random.randint(20, 80)\n", " trend = np.linspace(base, base - random.randint(10, 30), len(months))\n", " else: # neutral\n", " base = random.randint(80, 160)\n", " trend = np.full(len(months), base + random.randint(-10, 10))\n", "\n", " seasonality = 10 * np.sin(np.linspace(0, 3 * np.pi, len(months)))\n", " noise = np.random.normal(0, 5, len(months))\n", " monthly_sales = np.clip(trend + seasonality + noise, a_min=0, a_max=None).astype(int)\n", "\n", " return list(zip(months.strftime(\"%Y-%m\"), monthly_sales))" ] }, { "cell_type": "markdown", "metadata": { "id": "L2ak1HlcgoTe" }, "source": [ "### *b. Run the function as part of building sales_data*" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "id": "SlJ24AUafoDB" }, "outputs": [], "source": [ "sales_data = []\n", "for _, row in df_books.iterrows():\n", " records = generate_sales_profile(row[\"sentiment_label\"])\n", " for month, units in records:\n", " sales_data.append({\n", " \"title\": row[\"title\"],\n", " \"month\": month,\n", " \"units_sold\": units,\n", " \"sentiment_label\": row[\"sentiment_label\"]\n", " })" ] }, { "cell_type": "markdown", "metadata": { "id": "4IXZKcCSgxnq" }, "source": [ "### *c. ✋🏻🛑⛔️ Create a df_sales DataFrame from sales_data*" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "id": "wcN6gtiZg-ws", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "ab1669ce-3405-457e-b87a-1c4e8afe8c04" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " title month units_sold sentiment_label\n", "0 A Light in the Attic 2024-09 100 neutral\n", "1 A Light in the Attic 2024-10 109 neutral\n", "2 A Light in the Attic 2024-11 102 neutral\n", "3 A Light in the Attic 2024-12 107 neutral\n", "4 A Light in the Attic 2025-01 108 neutral" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlemonthunits_soldsentiment_label
0A Light in the Attic2024-09100neutral
1A Light in the Attic2024-10109neutral
2A Light in the Attic2024-11102neutral
3A Light in the Attic2024-12107neutral
4A Light in the Attic2025-01108neutral
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df_sales\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"A Light in the Attic\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"month\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"2024-10\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"units_sold\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3,\n \"min\": 100,\n \"max\": 109,\n \"num_unique_values\": 5,\n \"samples\": [\n 109\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_label\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"neutral\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ], "source": [ "df_sales = pd.DataFrame(sales_data)\n", "display(df_sales.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "EhIjz9WohAmZ" }, "source": [ "### *d. Save df_sales as synthetic_sales_data.csv & view first few lines*" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MzbZvLcAhGaH", "outputId": "2938cf8b-322c-4c62-cc4e-a7c362929088" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " title month units_sold sentiment_label\n", "0 A Light in the Attic 2024-09 100 neutral\n", "1 A Light in the Attic 2024-10 109 neutral\n", "2 A Light in the Attic 2024-11 102 neutral\n", "3 A Light in the Attic 2024-12 107 neutral\n", "4 A Light in the Attic 2025-01 108 neutral\n" ] } ], "source": [ "df_sales.to_csv(\"synthetic_sales_data.csv\", index=False)\n", "\n", "print(df_sales.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "7g9gqBgQMtJn" }, "source": [ "## **5.** 🎯 Generate synthetic customer reviews" ] }, { "cell_type": "markdown", "metadata": { "id": "Gi4y9M9KuDWx" }, "source": [ "### *a. ✋🏻🛑⛔️ Ask ChatGPT to create a list of 50 distinct generic book review texts for the sentiment labels \"positive\", \"neutral\", and \"negative\" called synthetic_reviews_by_sentiment*" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "id": "b3cd2a50" }, "outputs": [], "source": [ "synthetic_reviews_by_sentiment = {\n", " \"positive\": [\n", " \"A compelling and heartwarming read that stayed with me long after I finished.\",\n", " \"Brilliantly written! The characters were unforgettable and the plot was engaging.\",\n", " \"One of the best books I've read this year — inspiring and emotionally rich.\",\n", " \"Absolutely captivating, I couldn't put it down. A must-read for everyone.\",\n", " \"A masterpiece of storytelling, with deeply developed characters and a thought-provoking plot.\",\n", " \"Highly recommended! This book exceeded all my expectations and left a lasting impression.\",\n", " \"Such a beautiful and profound story, I found myself thinking about it for days.\",\n", " \"The writing style is exquisite, making every page a joy to read. Truly a gem.\",\n", " \"An incredibly moving and powerful narrative that will touch your heart.\",\n", " \"From start to finish, this book was a delightful journey. Pure literary brilliance.\",\n", " \"I was completely engrossed. The author's voice is unique and compelling.\",\n", " \"A triumphant and inspiring tale that reminds us of the best of humanity.\",\n", " \"Simply fantastic! A book that will make you laugh, cry, and ponder.\",\n", " \"Every word felt perfectly placed. A truly immersive and unforgettable experience.\",\n", " \"This book is a breath of fresh air. So original and beautifully executed.\",\n", " \"I loved the intricate plot and the surprising twists. Kept me on the edge of my seat.\",\n", " \"A profoundly insightful and brilliantly crafted work. An instant classic.\",\n", " \"The characters felt so real, I felt like I knew them personally. Wonderful.\",\n", " \"A captivating debut that promises a bright future for this author. Incredible.\",\n", " \"This book is a treasure! I'll be recommending it to everyone I know.\",\n", " \"Engaging, eloquent, and absolutely enthralling from cover to cover.\",\n", " \"A truly exceptional book that deserves all the accolades it receives.\",\n", " \"The narrative flow is flawless, drawing you into its world effortlessly.\",\n", " \"I couldn't have asked for a better read. It was everything I hoped for and more.\",\n", " \"This author has a gift for storytelling. A truly magical reading experience.\",\n", " \"A heartwarming story that is both charming and profound. Highly enjoyable.\",\n", " \"The prose is stunning, painting vivid pictures with every sentence. Artful.\",\n", " \"An absolute page-turner! I devoured it in one sitting and loved every moment.\",\n", " \"This book is an emotional rollercoaster in the best possible way. Superb.\",\n", " \"Full of wisdom and wit, this book is a delight for both mind and spirit.\",\n", " \"A thought-provoking read that challenged my perceptions. So well done.\",\n", " \"The world-building is spectacular, creating a rich and believable setting.\",\n", " \"I felt a deep connection to the story and its characters. Very impactful.\",\n", " \"A shining example of contemporary literature. An absolute must-read.\",\n", " \"This book has it all: intrigue, emotion, and masterful writing. Perfect.\",\n", " \"It left me feeling uplifted and hopeful. A truly inspiring piece of work.\",\n", " \"The plot twists were ingenious, keeping me guessing until the very end.\",\n", " \"A remarkable achievement in storytelling. I'm already looking forward to more.\",\n", " \"This is the kind of book you want to reread immediately after finishing.\",\n", " \"The depth of emotion conveyed in these pages is simply astounding.\",\n", " \"A perfect blend of adventure and introspection. Absolutely brilliant.\",\n", " \"This book captures the imagination and doesn't let go. Unforgettable.\",\n", " \"I was enchanted from the first paragraph. A truly magical reading.\",\n", " \"The characters' journeys resonated deeply with me. A beautiful narrative.\",\n", " \"An expertly crafted story that will stay with you for a long time. Powerful.\",\n", " \"Filled with insightful observations and memorable scenes. A fantastic read.\",\n", " \"This book is pure escapism, but with substance. Loved every minute.\",\n", " \"The author's voice is so distinct and engaging. A real literary treat.\",\n", " \"A compelling narrative that explores complex themes with grace and skill.\",\n", " \"Finished this with a huge smile on my face. What a wonderful book!\"\n", " ],\n", " \"neutral\": [\n", " \"An average book — not great, but not bad either.\",\n", " \"Some parts really stood out, others felt a bit flat.\",\n", " \"It was okay overall. A decent way to pass the time.\",\n", " \"I had mixed feelings about this one. It didn't fully engage me.\",\n", " \"The story was predictable, but the writing was decent enough.\",\n", " \"Neither impressed nor disappointed. It was just... fine.\",\n", " \"I finished it, but I don't think I'll remember it much.\",\n", " \"It had its moments, but lacked a consistent flow to truly shine.\",\n", " \"A serviceable read, though it didn't leave a strong impression.\",\n", " \"The concept was interesting, but the execution was only adequate.\",\n", " \"I wouldn't go out of my way to recommend it, but it wasn't terrible.\",\n", " \"It was a quick read, but not particularly memorable or impactful.\",\n", " \"The plot was a bit slow in places, picking up towards the end.\",\n", " \"I can see why some people might like it, but it wasn't for me.\",\n", " \"It felt like a standard genre piece, without much to distinguish it.\",\n", " \"The writing was clear, but the story lacked a certain spark.\",\n", " \"I found it somewhat bland, despite a few engaging scenes.\",\n", " \"It didn't grab my attention, but I didn't actively dislike it either.\",\n", " \"A middle-of-the-road experience. Nothing groundbreaking here.\",\n", " \"The characters were alright, but I didn't deeply connect with any of them.\",\n", " \"It served its purpose as a distraction, but nothing more profound.\",\n", " \"I had higher hopes, but it turned out to be merely satisfactory.\",\n", " \"The premise was strong, but the story didn't quite live up to it.\",\n", " \"It was an easy read, but ultimately forgettable in the grand scheme.\",\n", " \"A few interesting ideas, but they weren't fully developed.\",\n", " \"This book was just alright; it didn't captivate or offend.\",\n", " \"The narrative felt a bit disjointed at times, affecting the flow.\",\n", " \"It passed the time, but didn't inspire any strong feelings.\",\n", " \"I found it neither compelling nor particularly dull. Perfectly average.\",\n", " \"The world-building was okay, but nothing that truly stood out.\",\n", " \"I read it, and that's about all I can say. No strong opinions.\",\n", " \"The pacing was uneven, with some sections dragging more than others.\",\n", " \"It's a book that exists. It tells a story. That's it.\",\n", " \"I'm not sure what to make of it. It had some good points and some weaker ones.\",\n", " \"A rather generic offering in its genre. Not bad, but not exceptional.\",\n", " \"The writing style was fine, but the story didn't resonate with me.\",\n", " \"It was a decent effort, but fell short of being truly engaging.\",\n", " \"I wouldn't reread it, but I don't regret reading it either.\",\n", " \"The ending was neither satisfying nor disappointing. Just an ending.\",\n", " \"It lacked a certain emotional depth to make it truly impactful.\",\n", " \"A straightforward read. No surprises, good or bad.\",\n", " \"Some elements were interesting, but the whole didn't quite come together.\",\n", " \"It's the kind of book you pick up when you have nothing else.\",\n", " \"The story had potential, but it wasn't fully realized in the end.\",\n", " \"I didn't love it, but I didn't hate it. It just was.\",\n", " \"The characters felt a bit two-dimensional, preventing full immersion.\",\n", " \"A decent enough plot, but it never really soared.\",\n", " \"I appreciate the effort, but it didn't quite hit the mark for me.\",\n", " \"It was a perfectly acceptable book. Nothing to write home about.\",\n", " \"My feelings about this book are decidedly ambivalent.\"\n", " ],\n", " \"negative\": [\n", " \"I struggled to get through this one — it just didn’t grab me.\",\n", " \"The plot was confusing and the characters felt underdeveloped.\",\n", " \"Disappointing. I had high hopes, but they weren't met.\",\n", " \"A complete waste of time. I couldn't connect with anything in it.\",\n", " \"Poorly written and incredibly dull. I gave up halfway through.\",\n", " \"This book was a chore to read. I found it utterly frustrating.\",\n", " \"I regret starting this. The story was nonsensical and badly executed.\",\n", " \"The characters were irritating, and the plot went nowhere fast.\",\n", " \"An agonizingly slow read with no redeeming qualities. Avoid at all costs.\",\n", " \"This is one of the worst books I've ever attempted to read. Terrible.\",\n", " \"I found it incredibly boring and predictable. A struggle to finish.\",\n", " \"The writing was clunky, and the story felt forced and unoriginal.\",\n", " \"Nothing about this book worked for me. A truly awful experience.\",\n", " \"I'm genuinely surprised this got published. It's a mess.\",\n", " \"A frustrating read from beginning to end. Just pure disappointment.\",\n", " \"The plot holes were enormous, and the dialogue was unbelievably bad.\",\n", " \"I couldn't suspend my disbelief for a second. Utterly ridiculous.\",\n", " \"This book is a prime example of how not to write a story.\",\n", " \"My brain cells collectively died while reading this. So bad.\",\n", " \"The author clearly had no idea what they were doing. A disaster.\",\n", " \"I wish I could unread this. It was a painful and unrewarding experience.\",\n", " \"Absolutely dreadful. I've never been so bored by a book.\",\n", " \"The narrative was a confusing jumble, and the ending made no sense.\",\n", " \"I picked it up hoping for a good time, but it delivered only misery.\",\n", " \"Save your money and your time. This book is not worth it.\",\n", " \"A truly uninspired and poorly conceived piece of fiction.\",\n", " \"The pacing was glacial, and the plot never managed to engage me.\",\n", " \"I felt like I was being punished while reading this book. Horrible.\",\n", " \"There's nothing positive I can say about this. A total flop.\",\n", " \"The premise was intriguing, but the execution was spectacularly bad.\",\n", " \"This book was a literary car crash. I couldn't look away, but only out of horror.\",\n", " \"I finished it, but I feel no sense of accomplishment, only relief.\",\n", " \"The author's voice was grating, and the story was utterly unconvincing.\",\n", " \"It reads like a first draft that desperately needed more editing.\",\n", " \"I kept waiting for it to get better, but it only got worse.\",\n", " \"This book is a prime candidate for a 'worst books ever' list.\",\n", " \"The characters were so poorly developed, they felt like cardboard cutouts.\",\n", " \"A bewildering and frustrating read that left me completely cold.\",\n", " \"I can't even find words for how disappointing this was. Just no.\",\n", " \"The story was so contrived, it felt insulting to my intelligence.\",\n", " \"This book managed to offend me on multiple levels. Not recommended.\",\n", " \"It's clear the author didn't care about their readers. What a mess.\",\n", " \"I'm angry I spent time on this. It was truly a terrible book.\",\n", " \"The writing was so amateurish, it pulled me out of the story constantly.\",\n", " \"I've read better fanfiction. This is just inexcusable.\",\n", " \"The plot was a nonsensical maze with no coherent direction.\",\n", " \"This book will make you question your life choices. Seriously bad.\",\n", " \"I wouldn't wish this book on my worst enemy. It's that bad.\",\n", " \"A monument to literary ineptitude. Absolutely nothing works.\",\n", " \"Reading this felt like wading through mud. Exhausting and pointless.\"\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "fQhfVaDmuULT" }, "source": [ "### *b. Generate 10 reviews per book using random sampling from the corresponding 50*" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "l2SRc3PjuTGM" }, "outputs": [], "source": [ "review_rows = []\n", "for _, row in df_books.iterrows():\n", " title = row['title']\n", " sentiment_label = row['sentiment_label']\n", " review_pool = synthetic_reviews_by_sentiment[sentiment_label]\n", " sampled_reviews = random.sample(review_pool, 10)\n", " for review_text in sampled_reviews:\n", " review_rows.append({\n", " \"title\": title,\n", " \"sentiment_label\": sentiment_label,\n", " \"review_text\": review_text,\n", " \"rating\": row['rating'],\n", " \"popularity_score\": row['popularity_score']\n", " })" ] }, { "cell_type": "markdown", "metadata": { "id": "bmJMXF-Bukdm" }, "source": [ "### *c. Create the final dataframe df_reviews & save it as synthetic_book_reviews.csv*" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "id": "ZUKUqZsuumsp" }, "outputs": [], "source": [ "df_reviews = pd.DataFrame(review_rows)\n", "df_reviews.to_csv(\"synthetic_book_reviews.csv\", index=False)" ] }, { "cell_type": "markdown", "source": [ "### *c. inputs for R*" ], "metadata": { "id": "_602pYUS3gY5" } }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3946e521", "outputId": "82baff7a-b67e-4cd9-9bf3-6c9abaf9a333" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "✅ Wrote synthetic_title_level_features.csv\n", "✅ Wrote synthetic_monthly_revenue_series.csv\n" ] } ], "source": [ "import numpy as np\n", "\n", "def _safe_num(s):\n", " return pd.to_numeric(\n", " pd.Series(s).astype(str).str.replace(r\"[^0-9.]\", \"\", regex=True),\n", " errors=\"coerce\"\n", " )\n", "\n", "# --- Clean book metadata (price/rating) ---\n", "df_books_r = df_books.copy()\n", "if \"price\" in df_books_r.columns:\n", " df_books_r[\"price\"] = _safe_num(df_books_r[\"price\"])\n", "if \"rating\" in df_books_r.columns:\n", " df_books_r[\"rating\"] = _safe_num(df_books_r[\"rating\"])\n", "\n", "df_books_r[\"title\"] = df_books_r[\"title\"].astype(str).str.strip()\n", "\n", "# --- Clean sales ---\n", "df_sales_r = df_sales.copy()\n", "df_sales_r[\"title\"] = df_sales_r[\"title\"].astype(str).str.strip()\n", "df_sales_r[\"month\"] = pd.to_datetime(df_sales_r[\"month\"], errors=\"coerce\")\n", "df_sales_r[\"units_sold\"] = _safe_num(df_sales_r[\"units_sold\"])\n", "\n", "# --- Clean reviews ---\n", "df_reviews_r = df_reviews.copy()\n", "df_reviews_r[\"title\"] = df_reviews_r[\"title\"].astype(str).str.strip()\n", "df_reviews_r[\"sentiment_label\"] = df_reviews_r[\"sentiment_label\"].astype(str).str.lower().str.strip()\n", "if \"rating\" in df_reviews_r.columns:\n", " df_reviews_r[\"rating\"] = _safe_num(df_reviews_r[\"rating\"])\n", "if \"popularity_score\" in df_reviews_r.columns:\n", " df_reviews_r[\"popularity_score\"] = _safe_num(df_reviews_r[\"popularity_score\"])\n", "\n", "# --- Sentiment shares per title (from reviews) ---\n", "sent_counts = (\n", " df_reviews_r.groupby([\"title\", \"sentiment_label\"])\n", " .size()\n", " .unstack(fill_value=0)\n", ")\n", "for lab in [\"positive\", \"neutral\", \"negative\"]:\n", " if lab not in sent_counts.columns:\n", " sent_counts[lab] = 0\n", "\n", "sent_counts[\"total_reviews\"] = sent_counts[[\"positive\", \"neutral\", \"negative\"]].sum(axis=1)\n", "den = sent_counts[\"total_reviews\"].replace(0, np.nan)\n", "sent_counts[\"share_positive\"] = sent_counts[\"positive\"] / den\n", "sent_counts[\"share_neutral\"] = sent_counts[\"neutral\"] / den\n", "sent_counts[\"share_negative\"] = sent_counts[\"negative\"] / den\n", "sent_counts = sent_counts.reset_index()\n", "\n", "# --- Sales aggregation per title ---\n", "sales_by_title = (\n", " df_sales_r.dropna(subset=[\"title\"])\n", " .groupby(\"title\", as_index=False)\n", " .agg(\n", " months_observed=(\"month\", \"nunique\"),\n", " avg_units_sold=(\"units_sold\", \"mean\"),\n", " total_units_sold=(\"units_sold\", \"sum\"),\n", " )\n", ")\n", "\n", "# --- Title-level features (join sales + books + sentiment) ---\n", "df_title = (\n", " sales_by_title\n", " .merge(df_books_r[[\"title\", \"price\", \"rating\"]], on=\"title\", how=\"left\")\n", " .merge(sent_counts[[\"title\", \"share_positive\", \"share_neutral\", \"share_negative\", \"total_reviews\"]],\n", " on=\"title\", how=\"left\")\n", ")\n", "\n", "df_title[\"avg_revenue\"] = df_title[\"avg_units_sold\"] * df_title[\"price\"]\n", "df_title[\"total_revenue\"] = df_title[\"total_units_sold\"] * df_title[\"price\"]\n", "\n", "df_title.to_csv(\"synthetic_title_level_features.csv\", index=False)\n", "print(\"✅ Wrote synthetic_title_level_features.csv\")\n", "\n", "# --- Monthly revenue series (proxy: units_sold * price) ---\n", "monthly_rev = (\n", " df_sales_r.merge(df_books_r[[\"title\", \"price\"]], on=\"title\", how=\"left\")\n", ")\n", "monthly_rev[\"revenue\"] = monthly_rev[\"units_sold\"] * monthly_rev[\"price\"]\n", "\n", "df_monthly = (\n", " monthly_rev.dropna(subset=[\"month\"])\n", " .groupby(\"month\", as_index=False)[\"revenue\"]\n", " .sum()\n", " .rename(columns={\"revenue\": \"total_revenue\"})\n", " .sort_values(\"month\")\n", ")\n", "# if revenue is all NA (e.g., missing price), fallback to units_sold as a teaching proxy\n", "if df_monthly[\"total_revenue\"].notna().sum() == 0:\n", " df_monthly = (\n", " df_sales_r.dropna(subset=[\"month\"])\n", " .groupby(\"month\", as_index=False)[\"units_sold\"]\n", " .sum()\n", " .rename(columns={\"units_sold\": \"total_revenue\"})\n", " .sort_values(\"month\")\n", " )\n", "\n", "df_monthly[\"month\"] = pd.to_datetime(df_monthly[\"month\"], errors=\"coerce\").dt.strftime(\"%Y-%m-%d\")\n", "df_monthly.to_csv(\"synthetic_monthly_revenue_series.csv\", index=False)\n", "print(\"✅ Wrote synthetic_monthly_revenue_series.csv\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "RYvGyVfXuo54" }, "source": [ "### *d. ✋🏻🛑⛔️ View the first few lines*" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "xfE8NMqOurKo", "outputId": "f74caa84-9ad3-4775-9b70-764af50bc410" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " title sentiment_label \\\n", "0 A Light in the Attic neutral \n", "1 A Light in the Attic neutral \n", "2 A Light in the Attic neutral \n", "3 A Light in the Attic neutral \n", "4 A Light in the Attic neutral \n", "\n", " review_text rating popularity_score \n", "0 A decent enough plot, but it never really soared. Three 3 \n", "1 The premise was strong, but the story didn't q... Three 3 \n", "2 A rather generic offering in its genre. Not ba... Three 3 \n", "3 A serviceable read, though it didn't leave a s... Three 3 \n", "4 I didn't love it, but I didn't hate it. It jus... Three 3 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlesentiment_labelreview_textratingpopularity_score
0A Light in the AtticneutralA decent enough plot, but it never really soared.Three3
1A Light in the AtticneutralThe premise was strong, but the story didn't q...Three3
2A Light in the AtticneutralA rather generic offering in its genre. Not ba...Three3
3A Light in the AtticneutralA serviceable read, though it didn't leave a s...Three3
4A Light in the AtticneutralI didn't love it, but I didn't hate it. It jus...Three3
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df_reviews\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"A Light in the Attic\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_label\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"neutral\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review_text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"The premise was strong, but the story didn't quite live up to it.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Three\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"popularity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 3,\n \"max\": 3,\n \"num_unique_values\": 1,\n \"samples\": [\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ], "source": [ "display(df_reviews.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "465624d2" }, "source": [ "# Task\n", "Apply the `generate_popularity_score` function to the 'rating' column of `df_books` to create a new 'popularity_score' column and then display the first few rows of the updated `df_books` DataFrame." ] }, { "cell_type": "markdown", "metadata": { "id": "97b3b98b" }, "source": [ "## Create Popularity Score\n", "\n", "### Subtask:\n", "Apply the `generate_popularity_score` function to the 'rating' column of `df_books` to create a new 'popularity_score' column.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ca8b20fa" }, "source": [ "**Reasoning**:\n", "To create the 'popularity_score' column, I will apply the `generate_popularity_score` function to the 'rating' column of the `df_books` DataFrame.\n", "\n" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "73ea4f45", "outputId": "cd4da5db-1a16-464b-bb14-ba1756ed9242" }, "source": [ "df_books[\"popularity_score\"] = df_books[\"rating\"].apply(generate_popularity_score)\n", "print(df_books.head())" ], "execution_count": 49, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " title price rating popularity_score \\\n", "0 A Light in the Attic 51.77 Three 3 \n", "1 Tipping the Velvet 53.74 One 2 \n", "2 Soumission 50.10 One 2 \n", "3 Sharp Objects 47.82 Four 4 \n", "4 Sapiens: A Brief History of Humankind 54.23 Five 5 \n", "\n", " sentiment_label \n", "0 neutral \n", "1 negative \n", "2 negative \n", "3 positive \n", "4 neutral \n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "62839b87" }, "source": [ "## Final Task\n", "\n", "### Subtask:\n", "Confirm that the 'popularity_score' column has been successfully added to the `df_books` DataFrame.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "718d3b65" }, "source": [ "## Summary:\n", "\n", "### Data Analysis Key Findings\n", "* A new column named 'popularity\\_score' was successfully added to the `df_books` DataFrame by applying the `generate_popularity_score` function to the existing 'rating' column.\n", "* The `popularity_score` column contains integer values, which are numerical representations derived from the textual ratings (e.g., 'Three' converted to 3, 'One' to 2, 'Four' to 4, 'Five' to 3).\n", "* The first few rows of the `df_books` DataFrame confirm the successful creation and population of the 'popularity\\_score' column.\n", "\n", "### Insights or Next Steps\n", "* The newly created 'popularity\\_score' column provides a quantifiable metric for book popularity, enabling numerical analysis, sorting, and filtering operations that were not possible with textual ratings.\n", "* It would be beneficial to review the complete mapping logic of the `generate_popularity_score` function to ensure all possible textual ratings are handled correctly and consistently.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "98784b84" }, "source": [ "# Task\n", "Apply the `get_sentiment` function to the 'popularity_score' column of `df_books` to create a new 'sentiment_label' column, and display the first few rows of the updated `df_books` DataFrame." ] }, { "cell_type": "markdown", "metadata": { "id": "cb4c5e76" }, "source": [ "## Create Sentiment Label\n", "\n", "### Subtask:\n", "Apply the `get_sentiment` function to the 'popularity_score' column of `df_books` to create a new 'sentiment_label' column.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "67c77f2b" }, "source": [ "**Reasoning**:\n", "Apply the `get_sentiment` function to the 'popularity_score' column of `df_books` to create a new 'sentiment_label' column, then display the first few rows.\n", "\n" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "c547bb7d", "outputId": "70719a1d-682e-4eca-946e-2d98c952d53c" }, "source": [ "df_books[\"sentiment_label\"] = df_books[\"popularity_score\"].apply(get_sentiment)\n", "display(df_books.head())" ], "execution_count": 50, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " title price rating popularity_score \\\n", "0 A Light in the Attic 51.77 Three 3 \n", "1 Tipping the Velvet 53.74 One 2 \n", "2 Soumission 50.10 One 2 \n", "3 Sharp Objects 47.82 Four 4 \n", "4 Sapiens: A Brief History of Humankind 54.23 Five 5 \n", "\n", " sentiment_label \n", "0 neutral \n", "1 negative \n", "2 negative \n", "3 positive \n", "4 positive " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlepriceratingpopularity_scoresentiment_label
0A Light in the Attic51.77Three3neutral
1Tipping the Velvet53.74One2negative
2Soumission50.10One2negative
3Sharp Objects47.82Four4positive
4Sapiens: A Brief History of Humankind54.23Five5positive
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"display(df_books\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Tipping the Velvet\",\n \"Sapiens: A Brief History of Humankind\",\n \"Soumission\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.647672562837028,\n \"min\": 47.82,\n \"max\": 54.23,\n \"num_unique_values\": 5,\n \"samples\": [\n 53.74,\n 54.23,\n 50.1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"One\",\n \"Five\",\n \"Three\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"popularity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 2,\n \"max\": 5,\n \"num_unique_values\": 4,\n \"samples\": [\n 2,\n 5,\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sentiment_label\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"neutral\",\n \"negative\",\n \"positive\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {} } ] }, { "cell_type": "markdown", "metadata": { "id": "11f2522b" }, "source": [ "## Final Task\n", "\n", "### Subtask:\n", "Confirm that the 'sentiment_label' column has been added to the `df_books` DataFrame.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "11b1cc77" }, "source": [ "## Summary:\n", "\n", "### Data Analysis Key Findings\n", "* A new column, `sentiment_label`, was successfully added to the `df_books` DataFrame.\n", "* This new `sentiment_label` column was derived by applying the `get_sentiment` function to the existing `popularity_score` column.\n", "* The sentiment labels observed include 'neutral' for a `popularity_score` of 3, 'negative' for a `popularity_score` of 2, and 'positive' for a `popularity_score` of 4, demonstrating the correct application of the function.\n", "\n", "### Insights or Next Steps\n", "* The `df_books` DataFrame is now augmented with sentiment data, providing a categorical representation of book popularity that can be used for descriptive analysis or as a feature in predictive models.\n", "* The newly created `sentiment_label` column can be utilized to analyze the distribution of book sentiments, identify trends, or compare book categories based on their sentiment.\n" ] } ], "metadata": { "colab": { "collapsed_sections": [ "jpASMyIQMaAq", "lquNYCbfL9IM", "0IWuNpxxYDJF", "oCdTsin2Yfp3", "T0TOeRC4Yrnn", "duI5dv3CZYvF", "qMjRKMBQZlJi", "p-1Pr2szaqLk", "SIaJUGIpaH4V", "pY4yCoIuaQqp", "n4-TaNTFgPak", "HnngRNTgacYt", "HF9F9HIzgT7Z", "T8AdKkmASq9a", "OhXbdGD5fH0c", "L2ak1HlcgoTe", "4IXZKcCSgxnq", "EhIjz9WohAmZ", "Gi4y9M9KuDWx", "fQhfVaDmuULT", "bmJMXF-Bukdm", "RYvGyVfXuo54" ], "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }