Spaces:

ESCP
/

GD-5-SE21

Sleeping

App Files Files Community

garancebl commited on Apr 19

Commit

6af89b7

verified ·

1 Parent(s): 0fb1abb

Upload 2 files

Browse files

Files changed (2) hide show

1_Data_Creation_GroupD5.ipynb +1402 -0
2a_Python_Analysis_GroupD5.ipynb +0 -0

1_Data_Creation_GroupD5.ipynb ADDED Viewed

	@@ -0,0 +1,1402 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4ba6aba8"
+      },
+      "source": [
+        "# 🤖 **Data Collection, Creation, Storage, and Processing**\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### N.B. on dataset naming\n",
+        "\n",
+        "This notebook is adapted from the original workshop template, which was based on book data.  \n",
+        "For consistency with the second notebook and to avoid breaking the pipeline, some variable and file names (such as \"df_books\" or \"synthetic_book_reviews.csv\") were kept unchanged.\n",
+        "\n",
+        "However, all the data used in this project relates to hotels.  \n",
+        "For example, \"title\" refers to the hotel name, \"price\" corresponds to a proxy of the hotel ADR, and \"units_sold\" represents booking demand.\n",
+        "\n",
+        "This approach allows us to reuse the structure of the original notebooks while applying it to a different business problem."
+      ],
+      "metadata": {
+        "id": "N_ZPZM4Ugbr2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We use two datasets for this project.\n",
+        "The first one is a hotel booking dataset with information like price (ADR), cancellations and booking details.\n",
+        "The second one is a hotel review dataset with text reviews and ratings.\n",
+        "\n",
+        "The first dataset is quantitative data and the second one is qualitative data."
+      ],
+      "metadata": {
+        "id": "WzucGkQ4grQm"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jpASMyIQMaAq"
+      },
+      "source": [
+        "## **1.** 📦 Install required packages"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 21,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "f48c8f8c",
+        "outputId": "d267db0b-b091-418f-a5c3-5ada4358f32e"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n",
+            "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n",
+            "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n",
+            "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n",
+            "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n",
+            "Requirement already satisfied: textblob in /usr/local/lib/python3.12/dist-packages (0.19.0)\n",
+            "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n",
+            "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n",
+            "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n",
+            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n",
+            "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)\n",
+            "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n",
+            "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n",
+            "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.62.1)\n",
+            "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.5.0)\n",
+            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n",
+            "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n",
+            "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n",
+            "Requirement already satisfied: nltk>=3.9 in /usr/local/lib/python3.12/dist-packages (from textblob) (3.9.1)\n",
+            "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (8.3.2)\n",
+            "Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (1.5.3)\n",
+            "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (2025.11.3)\n",
+            "Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (4.67.3)\n",
+            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n"
+          ]
+        }
+      ],
+      "source": [
+        "!pip install beautifulsoup4 pandas matplotlib seaborn numpy textblob"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lquNYCbfL9IM"
+      },
+      "source": [
+        "## **2.** 🏨 Load hotel booking and hotel review datasets"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0IWuNpxxYDJF"
+      },
+      "source": [
+        "### *a. Initial setup*\n",
+        "Define the base url of the website you will scrape as well as how and what you will scrape"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "91d52125",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "84a7ad0f-fbd2-45f0-d4ce-495f24390ee4"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "          hotel  is_canceled  lead_time  arrival_date_year arrival_date_month  \\\n",
+            "0  Resort Hotel            0        342               2015               July   \n",
+            "1  Resort Hotel            0        737               2015               July   \n",
+            "2  Resort Hotel            0          7               2015               July   \n",
+            "3  Resort Hotel            0         13               2015               July   \n",
+            "4  Resort Hotel            0         14               2015               July   \n",
+            "\n",
+            "   arrival_date_week_number  arrival_date_day_of_month  \\\n",
+            "0                        27                          1   \n",
+            "1                        27                          1   \n",
+            "2                        27                          1   \n",
+            "3                        27                          1   \n",
+            "4                        27                          1   \n",
+            "\n",
+            "   stays_in_weekend_nights  stays_in_week_nights  adults  ...  deposit_type  \\\n",
+            "0                        0                     0       2  ...    No Deposit   \n",
+            "1                        0                     0       2  ...    No Deposit   \n",
+            "2                        0                     1       1  ...    No Deposit   \n",
+            "3                        0                     1       1  ...    No Deposit   \n",
+            "4                        0                     2       2  ...    No Deposit   \n",
+            "\n",
+            "   agent company days_in_waiting_list customer_type   adr  \\\n",
+            "0    NaN     NaN                    0     Transient   0.0   \n",
+            "1    NaN     NaN                    0     Transient   0.0   \n",
+            "2    NaN     NaN                    0     Transient  75.0   \n",
+            "3  304.0     NaN                    0     Transient  75.0   \n",
+            "4  240.0     NaN                    0     Transient  98.0   \n",
+            "\n",
+            "   required_car_parking_spaces  total_of_special_requests  reservation_status  \\\n",
+            "0                            0                          0           Check-Out   \n",
+            "1                            0                          0           Check-Out   \n",
+            "2                            0                          0           Check-Out   \n",
+            "3                            0                          0           Check-Out   \n",
+            "4                            0                          1           Check-Out   \n",
+            "\n",
+            "  reservation_status_date  \n",
+            "0              2015-07-01  \n",
+            "1              2015-07-01  \n",
+            "2              2015-07-02  \n",
+            "3              2015-07-02  \n",
+            "4              2015-07-03  \n",
+            "\n",
+            "[5 rows x 32 columns]\n",
+            "                 name      city              reviews.date  reviews.rating  \\\n",
+            "0  Hotel Russo Palace  Mableton 2013-09-22 00:00:00+00:00             4.0   \n",
+            "1  Hotel Russo Palace  Mableton 2015-04-03 00:00:00+00:00             5.0   \n",
+            "2  Hotel Russo Palace  Mableton 2014-05-13 00:00:00+00:00             5.0   \n",
+            "3  Hotel Russo Palace  Mableton 2013-10-27 00:00:00+00:00             5.0   \n",
+            "4  Hotel Russo Palace  Mableton 2015-03-05 00:00:00+00:00             5.0   \n",
+            "\n",
+            "                                        reviews.text    hotel_type  \n",
+            "0  Pleasant 10 min walk along the sea front to th...  Resort Hotel  \n",
+            "1  Really lovely hotel. Stayed on the very top fl...  Resort Hotel  \n",
+            "2  Ett mycket bra hotell. Det som drog ner betyge...  Resort Hotel  \n",
+            "3  We stayed here for four nights in October. The...  Resort Hotel  \n",
+            "4  We stayed here for four nights in October. The...  Resort Hotel  \n"
+          ]
+        }
+      ],
+      "source": [
+        "import pandas as pd\n",
+        "import numpy as np\n",
+        "import random\n",
+        "import time\n",
+        "\n",
+        "# Load hotel datasets\n",
+        "bookings = pd.read_csv(\"hotel_bookings.csv\")\n",
+        "reviews_raw = pd.read_csv(\"7282_1.csv\")\n",
+        "\n",
+        "# Keep useful columns from reviews\n",
+        "reviews_raw = reviews_raw[[\n",
+        "    \"name\",\n",
+        "    \"city\",\n",
+        "    \"reviews.date\",\n",
+        "    \"reviews.rating\",\n",
+        "    \"reviews.text\"\n",
+        "]].copy()\n",
+        "\n",
+        "reviews_raw = reviews_raw.dropna(subset=[\"name\", \"reviews.text\", \"reviews.rating\"])\n",
+        "reviews_raw[\"reviews.date\"] = pd.to_datetime(reviews_raw[\"reviews.date\"], errors=\"coerce\")\n",
+        "reviews_raw = reviews_raw.dropna(subset=[\"reviews.date\"])\n",
+        "\n",
+        "# Create hotel type from review dataset\n",
+        "def classify_hotel_type(name, city):\n",
+        "    text = (str(name) + \" \" + str(city)).lower()\n",
+        "    resort_words = [\"resort\", \"spa\", \"beach\", \"island\", \"sea\", \"pool\", \"palace\"]\n",
+        "    for word in resort_words:\n",
+        "        if word in text:\n",
+        "            return \"Resort Hotel\"\n",
+        "    return \"City Hotel\"\n",
+        "\n",
+        "reviews_raw[\"hotel_type\"] = reviews_raw.apply(\n",
+        "    lambda row: classify_hotel_type(row[\"name\"], row[\"city\"]),\n",
+        "    axis=1\n",
+        ")\n",
+        "\n",
+        "# Mean ADR by hotel type from booking dataset\n",
+        "adr_by_type = bookings.groupby(\"hotel\")[\"adr\"].mean().to_dict()\n",
+        "\n",
+        "# Create hotel-level dataset using review data\n",
+        "df_books = (\n",
+        "    reviews_raw.groupby(\"name\", as_index=False)\n",
+        "    .agg(\n",
+        "        rating=(\"reviews.rating\", \"mean\"),\n",
+        "        n_reviews=(\"reviews.text\", \"size\"),\n",
+        "        hotel_type=(\"hotel_type\", lambda x: x.mode().iloc[0] if not x.mode().empty else \"City Hotel\"),\n",
+        "        city=(\"city\", lambda x: x.mode().iloc[0] if not x.mode().empty else \"Unknown\")\n",
+        "    )\n",
+        ")\n",
+        "\n",
+        "# Keep \"title\" column name for compatibility with notebook 2\n",
+        "df_books = df_books.rename(columns={\"name\": \"title\"})\n",
+        "\n",
+        "# Create a hotel price proxy based on real ADR\n",
+        "df_books[\"price\"] = df_books[\"hotel_type\"].map(adr_by_type)\n",
+        "df_books[\"price\"] = df_books[\"price\"] * np.random.uniform(0.85, 1.15, len(df_books))\n",
+        "\n",
+        "print(bookings.head())\n",
+        "print(reviews_raw.head())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "oCdTsin2Yfp3"
+      },
+      "source": [
+        "### *b. Build a hotel-level dataframe with title, price, and rating*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 23,
+      "metadata": {
+        "id": "xqO5Y3dnYhxt",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "3c6eb1f7-9812-4ce0-bd25-d9827a378de5"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "                            title    rating  n_reviews  hotel_type  \\\n",
+            "0                        1785 Inn  2.625000         16  City Hotel   \n",
+            "1                      1900 House  4.571429         14  City Hotel   \n",
+            "2              40 Berkeley Hostel  3.329193        161  City Hotel   \n",
+            "3  A Bed & Breakfast In Cambridge  3.574074         54  City Hotel   \n",
+            "4                 Acorn Motor Inn  3.750000         20  City Hotel   \n",
+            "\n",
+            "           city       price  \n",
+            "0  North Conway  118.821735  \n",
+            "1  Narragansett  117.644891  \n",
+            "2        Boston  105.233747  \n",
+            "3     Cambridge   93.780093  \n",
+            "4    Oak Harbor  100.787105  \n",
+            "(623, 6)\n"
+          ]
+        }
+      ],
+      "source": [
+        "# The hotel-level dataframe is already created in cell 5\n",
+        "print(df_books.head())\n",
+        "print(df_books.shape)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T0TOeRC4Yrnn"
+      },
+      "source": [
+        "### *c. ✋🏻🛑⛔️ Use the hotel-level dataframe as df_books with \"title\", \"price\", and \"rating\"*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 24,
+      "metadata": {
+        "id": "l5FkkNhUYTHh",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        },
+        "outputId": "26d2a8d5-97f1-4918-bb96-8034659868c5"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                            title       price    rating  n_reviews  \\\n",
+              "0                        1785 Inn  118.821735  2.625000         16   \n",
+              "1                      1900 House  117.644891  4.571429         14   \n",
+              "2              40 Berkeley Hostel  105.233747  3.329193        161   \n",
+              "3  A Bed & Breakfast In Cambridge   93.780093  3.574074         54   \n",
+              "4                 Acorn Motor Inn  100.787105  3.750000         20   \n",
+              "\n",
+              "   hotel_type          city  \n",
+              "0  City Hotel  North Conway  \n",
+              "1  City Hotel  Narragansett  \n",
+              "2  City Hotel        Boston  \n",
+              "3  City Hotel     Cambridge  \n",
+              "4  City Hotel    Oak Harbor  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5a668a67-89ae-489e-a278-cbe66cf5b6e5\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>title</th>\n",
+              "      <th>price</th>\n",
+              "      <th>rating</th>\n",
+              "      <th>n_reviews</th>\n",
+              "      <th>hotel_type</th>\n",
+              "      <th>city</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>1785 Inn</td>\n",
+              "      <td>118.821735</td>\n",
+              "      <td>2.625000</td>\n",
+              "      <td>16</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>North Conway</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>1900 House</td>\n",
+              "      <td>117.644891</td>\n",
+              "      <td>4.571429</td>\n",
+              "      <td>14</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Narragansett</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>40 Berkeley Hostel</td>\n",
+              "      <td>105.233747</td>\n",
+              "      <td>3.329193</td>\n",
+              "      <td>161</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Boston</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>A Bed &amp; Breakfast In Cambridge</td>\n",
+              "      <td>93.780093</td>\n",
+              "      <td>3.574074</td>\n",
+              "      <td>54</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Cambridge</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>Acorn Motor Inn</td>\n",
+              "      <td>100.787105</td>\n",
+              "      <td>3.750000</td>\n",
+              "      <td>20</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Oak Harbor</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5a668a67-89ae-489e-a278-cbe66cf5b6e5')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-5a668a67-89ae-489e-a278-cbe66cf5b6e5 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-5a668a67-89ae-489e-a278-cbe66cf5b6e5');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "variable_name": "df_books",
+              "summary": "{\n  \"name\": \"df_books\",\n  \"rows\": 623,\n  \"fields\": [\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 623,\n        \"samples\": [\n          \"Hampton Inn Roanoke/salem\",\n          \"Super 8 Metropolis\",\n          \"Drury Inn and Suites Columbus Convention Center\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"price\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 9.514494608519156,\n        \"min\": 81.95384182332211,\n        \"max\": 121.06048279685156,\n        \"num_unique_values\": 623,\n        \"samples\": [\n          97.03571613889734,\n          96.43920055033284,\n          112.37363684162978\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"rating\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 1.0470488242452078,\n        \"min\": 0.0,\n        \"max\": 8.368932038834952,\n        \"num_unique_values\": 435,\n        \"samples\": [\n          2.25,\n          3.1818181818181817,\n          4.235294117647059\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"n_reviews\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 80,\n        \"min\": 1,\n        \"max\": 1185,\n        \"num_unique_values\": 157,\n        \"samples\": [\n          714,\n          44,\n          156\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"hotel_type\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"Resort Hotel\",\n          \"City Hotel\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"city\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 494,\n        \"samples\": [\n          \"Alexandria\",\n          \"Detroit Lakes\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 24
+        }
+      ],
+      "source": [
+        "# df_books is already ready and contains title, price, and rating\n",
+        "df_books = df_books[[\"title\", \"price\", \"rating\", \"n_reviews\", \"hotel_type\", \"city\"]].copy()\n",
+        "df_books.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "duI5dv3CZYvF"
+      },
+      "source": [
+        "### *d. Save web-scraped dataframe either as a CSV or Excel file*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 25,
+      "metadata": {
+        "id": "lC1U_YHtZifh"
+      },
+      "outputs": [],
+      "source": [
+        "# Save hotel-level dataframe\n",
+        "df_books.to_csv(\"hotel_level_features.csv\", index=False)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qMjRKMBQZlJi"
+      },
+      "source": [
+        "### *e. ✋🏻🛑⛔️ View first fiew lines*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 26,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        },
+        "id": "O_wIvTxYZqCK",
+        "outputId": "2f6edaf4-e853-4d9c-c2a1-c17502991c08"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                            title       price    rating  n_reviews  \\\n",
+              "0                        1785 Inn  118.821735  2.625000         16   \n",
+              "1                      1900 House  117.644891  4.571429         14   \n",
+              "2              40 Berkeley Hostel  105.233747  3.329193        161   \n",
+              "3  A Bed & Breakfast In Cambridge   93.780093  3.574074         54   \n",
+              "4                 Acorn Motor Inn  100.787105  3.750000         20   \n",
+              "\n",
+              "   hotel_type          city  \n",
+              "0  City Hotel  North Conway  \n",
+              "1  City Hotel  Narragansett  \n",
+              "2  City Hotel        Boston  \n",
+              "3  City Hotel     Cambridge  \n",
+              "4  City Hotel    Oak Harbor  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-68f009ad-80be-4ef7-b9a7-334999a5cb00\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>title</th>\n",
+              "      <th>price</th>\n",
+              "      <th>rating</th>\n",
+              "      <th>n_reviews</th>\n",
+              "      <th>hotel_type</th>\n",
+              "      <th>city</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>1785 Inn</td>\n",
+              "      <td>118.821735</td>\n",
+              "      <td>2.625000</td>\n",
+              "      <td>16</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>North Conway</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>1900 House</td>\n",
+              "      <td>117.644891</td>\n",
+              "      <td>4.571429</td>\n",
+              "      <td>14</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Narragansett</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>40 Berkeley Hostel</td>\n",
+              "      <td>105.233747</td>\n",
+              "      <td>3.329193</td>\n",
+              "      <td>161</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Boston</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>A Bed &amp; Breakfast In Cambridge</td>\n",
+              "      <td>93.780093</td>\n",
+              "      <td>3.574074</td>\n",
+              "      <td>54</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Cambridge</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>Acorn Motor Inn</td>\n",
+              "      <td>100.787105</td>\n",
+              "      <td>3.750000</td>\n",
+              "      <td>20</td>\n",
+              "      <td>City Hotel</td>\n",
+              "      <td>Oak Harbor</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-68f009ad-80be-4ef7-b9a7-334999a5cb00')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-68f009ad-80be-4ef7-b9a7-334999a5cb00 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-68f009ad-80be-4ef7-b9a7-334999a5cb00');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "variable_name": "df_books",
+              "summary": "{\n  \"name\": \"df_books\",\n  \"rows\": 623,\n  \"fields\": [\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 623,\n        \"samples\": [\n          \"Hampton Inn Roanoke/salem\",\n          \"Super 8 Metropolis\",\n          \"Drury Inn and Suites Columbus Convention Center\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"price\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 9.514494608519156,\n        \"min\": 81.95384182332211,\n        \"max\": 121.06048279685156,\n        \"num_unique_values\": 623,\n        \"samples\": [\n          97.03571613889734,\n          96.43920055033284,\n          112.37363684162978\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"rating\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 1.0470488242452078,\n        \"min\": 0.0,\n        \"max\": 8.368932038834952,\n        \"num_unique_values\": 435,\n        \"samples\": [\n          2.25,\n          3.1818181818181817,\n          4.235294117647059\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"n_reviews\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 80,\n        \"min\": 1,\n        \"max\": 1185,\n        \"num_unique_values\": 157,\n        \"samples\": [\n          714,\n          44,\n          156\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"hotel_type\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"Resort Hotel\",\n          \"City Hotel\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"city\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 494,\n        \"samples\": [\n          \"Alexandria\",\n          \"Detroit Lakes\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 26
+        }
+      ],
+      "source": [
+        "df_books.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "p-1Pr2szaqLk"
+      },
+      "source": [
+        "## **3.** 🧩 Create a meaningful connection between real & synthetic datasets"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SIaJUGIpaH4V"
+      },
+      "source": [
+        "### *a. Initial setup*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "id": "-gPXGcRPuV_9"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import random\n",
+        "from datetime import datetime\n",
+        "import warnings\n",
+        "\n",
+        "warnings.filterwarnings(\"ignore\")\n",
+        "random.seed(2025)\n",
+        "np.random.seed(2025)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pY4yCoIuaQqp"
+      },
+      "source": [
+        "### *b. Generate popularity scores based on hotel rating and review volume*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 28,
+      "metadata": {
+        "id": "mnd5hdAbaNjz"
+      },
+      "outputs": [],
+      "source": [
+        "def generate_popularity_score(avg_rating, n_reviews):\n",
+        "    base = round(avg_rating)\n",
+        "\n",
+        "    if n_reviews >= 20:\n",
+        "        volume_bonus = 1\n",
+        "    else:\n",
+        "        volume_bonus = 0\n",
+        "\n",
+        "    noise = random.choice([-1, 0, 0, 1])\n",
+        "\n",
+        "    return int(np.clip(base + volume_bonus + noise, 1, 5))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "n4-TaNTFgPak"
+      },
+      "source": [
+        "### *c. ✋🏻🛑⛔️ Create a \"popularity_score\" column from rating and number of reviews*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 29,
+      "metadata": {
+        "id": "V-G3OCUCgR07"
+      },
+      "outputs": [],
+      "source": [
+        "df_books[\"popularity_score\"] = df_books.apply(\n",
+        "    lambda row: generate_popularity_score(row[\"rating\"], row[\"n_reviews\"]),\n",
+        "    axis=1\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "HnngRNTgacYt"
+      },
+      "source": [
+        "### *d. Decide on the sentiment_label based on the popularity score with a get_sentiment function*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 30,
+      "metadata": {
+        "id": "kUtWmr8maZLZ"
+      },
+      "outputs": [],
+      "source": [
+        "def get_sentiment(popularity_score):\n",
+        "    if popularity_score <= 2:\n",
+        "        return \"negative\"\n",
+        "    elif popularity_score == 3:\n",
+        "        return \"neutral\"\n",
+        "    else:\n",
+        "        return \"positive\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "HF9F9HIzgT7Z"
+      },
+      "source": [
+        "### *e. ✋🏻🛑⛔️ Run the function to create a \"sentiment_label\" column from \"popularity_score\"*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 31,
+      "metadata": {
+        "id": "tafQj8_7gYCG"
+      },
+      "outputs": [],
+      "source": [
+        "df_books[\"sentiment_label\"] = df_books[\"popularity_score\"].apply(get_sentiment)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T8AdKkmASq9a"
+      },
+      "source": [
+        "## **4.** 📈 Generate synthetic hotel demand data for 18 months"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OhXbdGD5fH0c"
+      },
+      "source": [
+        "### *a. Create a generate_sales_profile function based on hotel type and popularity*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 32,
+      "metadata": {
+        "id": "qkVhYPXGbgEn"
+      },
+      "outputs": [],
+      "source": [
+        "from datetime import datetime\n",
+        "\n",
+        "# Build a real monthly baseline from booking data\n",
+        "bookings[\"month_date\"] = pd.to_datetime(\n",
+        "    bookings[\"arrival_date_month\"] + \" \" + bookings[\"arrival_date_year\"].astype(str),\n",
+        "    format=\"%B %Y\",\n",
+        "    errors=\"coerce\"\n",
+        ")\n",
+        "\n",
+        "bookings = bookings.dropna(subset=[\"month_date\"])\n",
+        "bookings[\"month_num\"] = bookings[\"month_date\"].dt.month\n",
+        "\n",
+        "monthly_baseline = (\n",
+        "    bookings.groupby([\"hotel\", \"month_num\"])\n",
+        "    .size()\n",
+        "    .reset_index(name=\"base_demand\")\n",
+        ")\n",
+        "\n",
+        "def generate_sales_profile(hotel_type, popularity_score):\n",
+        "    months = pd.date_range(end=datetime.today(), periods=18, freq=\"M\")\n",
+        "    hotel_baseline = monthly_baseline[monthly_baseline[\"hotel\"] == hotel_type]\n",
+        "\n",
+        "    if hotel_baseline.empty:\n",
+        "        base_mean = 100\n",
+        "    else:\n",
+        "        base_mean = hotel_baseline[\"base_demand\"].mean()\n",
+        "\n",
+        "    multiplier_map = {\n",
+        "        1: 0.6,\n",
+        "        2: 0.8,\n",
+        "        3: 1.0,\n",
+        "        4: 1.2,\n",
+        "        5: 1.4\n",
+        "    }\n",
+        "\n",
+        "    popularity_multiplier = multiplier_map.get(popularity_score, 1.0)\n",
+        "\n",
+        "    records = []\n",
+        "    for month in months:\n",
+        "        month_num = month.month\n",
+        "\n",
+        "        month_row = hotel_baseline[hotel_baseline[\"month_num\"] == month_num]\n",
+        "        if not month_row.empty:\n",
+        "            seasonal_base = month_row[\"base_demand\"].values[0]\n",
+        "        else:\n",
+        "            seasonal_base = base_mean\n",
+        "\n",
+        "        units = max(\n",
+        "            5,\n",
+        "            int(np.random.normal((seasonal_base / 40) * popularity_multiplier, 5))\n",
+        "        )\n",
+        "\n",
+        "        records.append((month.strftime(\"%Y-%m\"), units))\n",
+        "\n",
+        "    return records"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "L2ak1HlcgoTe"
+      },
+      "source": [
+        "### *b. Build sales_data using hotel_type and popularity_score*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 33,
+      "metadata": {
+        "id": "SlJ24AUafoDB"
+      },
+      "outputs": [],
+      "source": [
+        "sales_data = []\n",
+        "\n",
+        "for _, row in df_books.iterrows():\n",
+        "    records = generate_sales_profile(row[\"hotel_type\"], row[\"popularity_score\"])\n",
+        "    for month, units in records:\n",
+        "        sales_data.append({\n",
+        "            \"title\": row[\"title\"],\n",
+        "            \"month\": month,\n",
+        "            \"units_sold\": units,\n",
+        "            \"sentiment_label\": row[\"sentiment_label\"]\n",
+        "        })"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4IXZKcCSgxnq"
+      },
+      "source": [
+        "### *c. ✋🏻🛑⛔️ Create a df_sales DataFrame from sales_data*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 34,
+      "metadata": {
+        "id": "wcN6gtiZg-ws"
+      },
+      "outputs": [],
+      "source": [
+        "df_sales = pd.DataFrame(sales_data)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EhIjz9WohAmZ"
+      },
+      "source": [
+        "### *d. Save df_sales as synthetic_sales_data.csv & view first few lines*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 35,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "MzbZvLcAhGaH",
+        "outputId": "7975a59e-178e-4d25-98f5-02d90cbd97b0"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "      title    month  units_sold sentiment_label\n",
+            "0  1785 Inn  2024-10         151        negative\n",
+            "1  1785 Inn  2024-11          90        negative\n",
+            "2  1785 Inn  2024-12          75        negative\n",
+            "3  1785 Inn  2025-01          71        negative\n",
+            "4  1785 Inn  2025-02          98        negative\n"
+          ]
+        }
+      ],
+      "source": [
+        "df_sales.to_csv(\"synthetic_sales_data.csv\", index=False)\n",
+        "\n",
+        "print(df_sales.head())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7g9gqBgQMtJn"
+      },
+      "source": [
+        "## **5.** 🎯 Generate synthetic customer review dataset using hotel reviews"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Gi4y9M9KuDWx"
+      },
+      "source": [
+        "### *a. ✋🏻🛑⛔️ Create fallback review texts for each sentiment label*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 36,
+      "metadata": {
+        "id": "b3cd2a50"
+      },
+      "outputs": [],
+      "source": [
+        "synthetic_reviews_by_sentiment = {\n",
+        "    \"positive\": [\n",
+        "        \"The hotel was excellent and the overall experience was very satisfying.\",\n",
+        "        \"Very good service, clean rooms, and a pleasant stay.\",\n",
+        "        \"A great experience with friendly staff and strong service quality.\",\n",
+        "        \"The hotel exceeded expectations and the stay was very comfortable.\",\n",
+        "        \"Excellent service and a very enjoyable overall experience.\",\n",
+        "        \"The rooms were clean and the hotel atmosphere was very pleasant.\",\n",
+        "        \"A very satisfying stay with professional staff and good facilities.\",\n",
+        "        \"The hotel experience was smooth, comfortable, and enjoyable.\",\n",
+        "        \"Strong service quality and a very positive stay overall.\",\n",
+        "        \"The hotel offered a high-quality experience from start to finish.\"\n",
+        "    ],\n",
+        "    \"neutral\": [\n",
+        "        \"The stay was acceptable but not especially memorable.\",\n",
+        "        \"The hotel was average and the experience was correct overall.\",\n",
+        "        \"Some aspects were good, while others could be improved.\",\n",
+        "        \"The stay was fine but quite standard.\",\n",
+        "        \"The hotel met expectations without standing out.\",\n",
+        "        \"The overall experience was balanced, with both positive and negative points.\",\n",
+        "        \"The stay was decent and the service was acceptable.\",\n",
+        "        \"Nothing was particularly bad, but nothing was exceptional either.\",\n",
+        "        \"The experience was normal and relatively satisfactory.\",\n",
+        "        \"The hotel was reasonable but could improve in some areas.\"\n",
+        "    ],\n",
+        "    \"negative\": [\n",
+        "        \"The experience was disappointing and the service could be improved.\",\n",
+        "        \"The hotel did not fully meet expectations.\",\n",
+        "        \"Several aspects of the stay were below standard.\",\n",
+        "        \"The service quality was disappointing during the stay.\",\n",
+        "        \"The overall hotel experience was less satisfying than expected.\",\n",
+        "        \"Some important elements of the stay need improvement.\",\n",
+        "        \"The hotel experience was not fully satisfactory.\",\n",
+        "        \"The service and comfort level were below expectations.\",\n",
+        "        \"The stay had several weak points and was disappointing overall.\",\n",
+        "        \"The customer experience should be improved in future.\"\n",
+        "    ]\n",
+        "}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fQhfVaDmuULT"
+      },
+      "source": [
+        "### *b. Generate 10 reviews per hotel using real hotel reviews when available*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 37,
+      "metadata": {
+        "id": "l2SRc3PjuTGM"
+      },
+      "outputs": [],
+      "source": [
+        "review_rows = []\n",
+        "\n",
+        "for _, row in df_books.iterrows():\n",
+        "    hotel_name = row[\"title\"]\n",
+        "    sentiment_label = row[\"sentiment_label\"]\n",
+        "\n",
+        "    hotel_reviews = reviews_raw[reviews_raw[\"name\"] == hotel_name][\"reviews.text\"].dropna().tolist()\n",
+        "\n",
+        "    if len(hotel_reviews) >= 10:\n",
+        "        sampled_reviews = random.sample(hotel_reviews, 10)\n",
+        "    elif len(hotel_reviews) > 0:\n",
+        "        sampled_reviews = hotel_reviews.copy()\n",
+        "        while len(sampled_reviews) < 10:\n",
+        "            sampled_reviews.append(random.choice(hotel_reviews))\n",
+        "    else:\n",
+        "        sampled_reviews = [random.choice(synthetic_reviews_by_sentiment[sentiment_label]) for _ in range(10)]\n",
+        "\n",
+        "    for review_text in sampled_reviews[:10]:\n",
+        "        review_rows.append({\n",
+        "            \"title\": hotel_name,\n",
+        "            \"sentiment_label\": sentiment_label,\n",
+        "            \"review_text\": review_text,\n",
+        "            \"rating\": row[\"rating\"],\n",
+        "            \"popularity_score\": row[\"popularity_score\"]\n",
+        "        })"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bmJMXF-Bukdm"
+      },
+      "source": [
+        "### *c. Create the final dataframe df_reviews & save it as synthetic_book_reviews.csv*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 38,
+      "metadata": {
+        "id": "ZUKUqZsuumsp"
+      },
+      "outputs": [],
+      "source": [
+        "df_reviews = pd.DataFrame(review_rows)\n",
+        "df_reviews.to_csv(\"synthetic_book_reviews.csv\", index=False)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 39,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "3946e521",
+        "outputId": "89a60601-d358-4f6a-b789-5e81b04ca222"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "✅ Wrote synthetic_title_level_features.csv\n",
+            "✅ Wrote synthetic_monthly_revenue_series.csv\n"
+          ]
+        }
+      ],
+      "source": [
+        "\n",
+        "# ============================================================\n",
+        "# ✅ Create \"R-ready\" derived inputs (root-level files)\n",
+        "# ============================================================\n",
+        "# These two files make the R notebook robust and fast:\n",
+        "# 1) synthetic_title_level_features.csv  -> regression-ready, one row per title\n",
+        "# 2) synthetic_monthly_revenue_series.csv -> forecasting-ready, one row per month\n",
+        "\n",
+        "import numpy as np\n",
+        "\n",
+        "def _safe_num(s):\n",
+        "    return pd.to_numeric(\n",
+        "        pd.Series(s).astype(str).str.replace(r\"[^0-9.]\", \"\", regex=True),\n",
+        "        errors=\"coerce\"\n",
+        "    )\n",
+        "\n",
+        "# --- Clean hotel metadata (price/rating) ---\n",
+        "df_books_r = df_books.copy()\n",
+        "if \"price\" in df_books_r.columns:\n",
+        "    df_books_r[\"price\"] = _safe_num(df_books_r[\"price\"])\n",
+        "if \"rating\" in df_books_r.columns:\n",
+        "    df_books_r[\"rating\"] = _safe_num(df_books_r[\"rating\"])\n",
+        "\n",
+        "df_books_r[\"title\"] = df_books_r[\"title\"].astype(str).str.strip()\n",
+        "\n",
+        "# --- Clean sales ---\n",
+        "df_sales_r = df_sales.copy()\n",
+        "df_sales_r[\"title\"] = df_sales_r[\"title\"].astype(str).str.strip()\n",
+        "df_sales_r[\"month\"] = pd.to_datetime(df_sales_r[\"month\"], errors=\"coerce\")\n",
+        "df_sales_r[\"units_sold\"] = _safe_num(df_sales_r[\"units_sold\"])\n",
+        "\n",
+        "# --- Clean reviews ---\n",
+        "df_reviews_r = df_reviews.copy()\n",
+        "df_reviews_r[\"title\"] = df_reviews_r[\"title\"].astype(str).str.strip()\n",
+        "df_reviews_r[\"sentiment_label\"] = df_reviews_r[\"sentiment_label\"].astype(str).str.lower().str.strip()\n",
+        "if \"rating\" in df_reviews_r.columns:\n",
+        "    df_reviews_r[\"rating\"] = _safe_num(df_reviews_r[\"rating\"])\n",
+        "if \"popularity_score\" in df_reviews_r.columns:\n",
+        "    df_reviews_r[\"popularity_score\"] = _safe_num(df_reviews_r[\"popularity_score\"])\n",
+        "\n",
+        "# --- Sentiment shares per title (from reviews) ---\n",
+        "sent_counts = (\n",
+        "    df_reviews_r.groupby([\"title\", \"sentiment_label\"])\n",
+        "    .size()\n",
+        "    .unstack(fill_value=0)\n",
+        ")\n",
+        "for lab in [\"positive\", \"neutral\", \"negative\"]:\n",
+        "    if lab not in sent_counts.columns:\n",
+        "        sent_counts[lab] = 0\n",
+        "\n",
+        "sent_counts[\"total_reviews\"] = sent_counts[[\"positive\", \"neutral\", \"negative\"]].sum(axis=1)\n",
+        "den = sent_counts[\"total_reviews\"].replace(0, np.nan)\n",
+        "sent_counts[\"share_positive\"] = sent_counts[\"positive\"] / den\n",
+        "sent_counts[\"share_neutral\"]  = sent_counts[\"neutral\"]  / den\n",
+        "sent_counts[\"share_negative\"] = sent_counts[\"negative\"] / den\n",
+        "sent_counts = sent_counts.reset_index()\n",
+        "\n",
+        "# --- Sales aggregation per title ---\n",
+        "sales_by_title = (\n",
+        "    df_sales_r.dropna(subset=[\"title\"])\n",
+        "    .groupby(\"title\", as_index=False)\n",
+        "    .agg(\n",
+        "        months_observed=(\"month\", \"nunique\"),\n",
+        "        avg_units_sold=(\"units_sold\", \"mean\"),\n",
+        "        total_units_sold=(\"units_sold\", \"sum\"),\n",
+        "    )\n",
+        ")\n",
+        "\n",
+        "# --- Hotel-level features (join sales + hotel metadata + sentiment) ---\n",
+        "df_title = (\n",
+        "    sales_by_title\n",
+        "    .merge(df_books_r[[\"title\", \"price\", \"rating\"]], on=\"title\", how=\"left\")\n",
+        "    .merge(sent_counts[[\"title\", \"share_positive\", \"share_neutral\", \"share_negative\", \"total_reviews\"]],\n",
+        "           on=\"title\", how=\"left\")\n",
+        ")\n",
+        "\n",
+        "df_title[\"avg_revenue\"] = df_title[\"avg_units_sold\"] * df_title[\"price\"]\n",
+        "df_title[\"total_revenue\"] = df_title[\"total_units_sold\"] * df_title[\"price\"]\n",
+        "\n",
+        "df_title.to_csv(\"synthetic_title_level_features.csv\", index=False)\n",
+        "print(\"✅ Wrote synthetic_title_level_features.csv\")\n",
+        "\n",
+        "# --- Monthly revenue series (proxy: units_sold * price) ---\n",
+        "monthly_rev = (\n",
+        "    df_sales_r.merge(df_books_r[[\"title\", \"price\"]], on=\"title\", how=\"left\")\n",
+        ")\n",
+        "monthly_rev[\"revenue\"] = monthly_rev[\"units_sold\"] * monthly_rev[\"price\"]\n",
+        "\n",
+        "df_monthly = (\n",
+        "    monthly_rev.dropna(subset=[\"month\"])\n",
+        "    .groupby(\"month\", as_index=False)[\"revenue\"]\n",
+        "    .sum()\n",
+        "    .rename(columns={\"revenue\": \"total_revenue\"})\n",
+        "    .sort_values(\"month\")\n",
+        ")\n",
+        "# if revenue is all NA (e.g., missing price), fallback to units_sold as a teaching proxy\n",
+        "if df_monthly[\"total_revenue\"].notna().sum() == 0:\n",
+        "    df_monthly = (\n",
+        "        df_sales_r.dropna(subset=[\"month\"])\n",
+        "        .groupby(\"month\", as_index=False)[\"units_sold\"]\n",
+        "        .sum()\n",
+        "        .rename(columns={\"units_sold\": \"total_revenue\"})\n",
+        "        .sort_values(\"month\")\n",
+        "    )\n",
+        "\n",
+        "df_monthly[\"month\"] = pd.to_datetime(df_monthly[\"month\"], errors=\"coerce\").dt.strftime(\"%Y-%m-%d\")\n",
+        "df_monthly.to_csv(\"synthetic_monthly_revenue_series.csv\", index=False)\n",
+        "print(\"✅ Wrote synthetic_monthly_revenue_series.csv\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "RYvGyVfXuo54"
+      },
+      "source": [
+        "### *d. ✋🏻🛑⛔️ View the first few lines*"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 40,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "xfE8NMqOurKo",
+        "outputId": "952335f7-1288-4af7-f32b-3f2e52e7060b"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "      title sentiment_label  \\\n",
+            "0  1785 Inn        negative   \n",
+            "1  1785 Inn        negative   \n",
+            "2  1785 Inn        negative   \n",
+            "3  1785 Inn        negative   \n",
+            "4  1785 Inn        negative   \n",
+            "\n",
+            "                                         review_text  rating  popularity_score  \n",
+            "0  I am shocked by how many good reviews this res...   2.625                 2  \n",
+            "1  Very Reasonably priced, Nice Pub, Great breakf...   2.625                 2  \n",
+            "2  to share your opinion of this businesswith YP ...   2.625                 2  \n",
+            "3  My wife and I ate dinner at the 1785 inn durin...   2.625                 2  \n",
+            "4  Billy the bartender is awesome - ask him about...   2.625                 2  \n"
+          ]
+        }
+      ],
+      "source": [
+        "print(df_reviews.head())"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "collapsed_sections": [
+        "jpASMyIQMaAq",
+        "lquNYCbfL9IM",
+        "0IWuNpxxYDJF",
+        "oCdTsin2Yfp3",
+        "T0TOeRC4Yrnn",
+        "duI5dv3CZYvF",
+        "qMjRKMBQZlJi",
+        "p-1Pr2szaqLk",
+        "SIaJUGIpaH4V",
+        "pY4yCoIuaQqp",
+        "n4-TaNTFgPak",
+        "HnngRNTgacYt",
+        "HF9F9HIzgT7Z",
+        "T8AdKkmASq9a",
+        "OhXbdGD5fH0c",
+        "L2ak1HlcgoTe",
+        "4IXZKcCSgxnq",
+        "EhIjz9WohAmZ",
+        "Gi4y9M9KuDWx",
+        "fQhfVaDmuULT",
+        "bmJMXF-Bukdm",
+        "RYvGyVfXuo54"
+      ],
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

2a_Python_Analysis_GroupD5.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff