garancebl commited on
Commit
6af89b7
Β·
verified Β·
1 Parent(s): 0fb1abb

Upload 2 files

Browse files
1_Data_Creation_GroupD5.ipynb ADDED
@@ -0,0 +1,1402 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "4ba6aba8"
7
+ },
8
+ "source": [
9
+ "# πŸ€– **Data Collection, Creation, Storage, and Processing**\n"
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "markdown",
14
+ "source": [
15
+ "### N.B. on dataset naming\n",
16
+ "\n",
17
+ "This notebook is adapted from the original workshop template, which was based on book data. \n",
18
+ "For consistency with the second notebook and to avoid breaking the pipeline, some variable and file names (such as \"df_books\" or \"synthetic_book_reviews.csv\") were kept unchanged.\n",
19
+ "\n",
20
+ "However, all the data used in this project relates to hotels. \n",
21
+ "For example, \"title\" refers to the hotel name, \"price\" corresponds to a proxy of the hotel ADR, and \"units_sold\" represents booking demand.\n",
22
+ "\n",
23
+ "This approach allows us to reuse the structure of the original notebooks while applying it to a different business problem."
24
+ ],
25
+ "metadata": {
26
+ "id": "N_ZPZM4Ugbr2"
27
+ }
28
+ },
29
+ {
30
+ "cell_type": "markdown",
31
+ "source": [
32
+ "We use two datasets for this project.\n",
33
+ "The first one is a hotel booking dataset with information like price (ADR), cancellations and booking details.\n",
34
+ "The second one is a hotel review dataset with text reviews and ratings.\n",
35
+ "\n",
36
+ "The first dataset is quantitative data and the second one is qualitative data."
37
+ ],
38
+ "metadata": {
39
+ "id": "WzucGkQ4grQm"
40
+ }
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {
45
+ "id": "jpASMyIQMaAq"
46
+ },
47
+ "source": [
48
+ "## **1.** πŸ“¦ Install required packages"
49
+ ]
50
+ },
51
+ {
52
+ "cell_type": "code",
53
+ "execution_count": 21,
54
+ "metadata": {
55
+ "colab": {
56
+ "base_uri": "https://localhost:8080/"
57
+ },
58
+ "id": "f48c8f8c",
59
+ "outputId": "d267db0b-b091-418f-a5c3-5ada4358f32e"
60
+ },
61
+ "outputs": [
62
+ {
63
+ "output_type": "stream",
64
+ "name": "stdout",
65
+ "text": [
66
+ "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n",
67
+ "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n",
68
+ "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n",
69
+ "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n",
70
+ "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n",
71
+ "Requirement already satisfied: textblob in /usr/local/lib/python3.12/dist-packages (0.19.0)\n",
72
+ "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n",
73
+ "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n",
74
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n",
75
+ "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n",
76
+ "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)\n",
77
+ "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n",
78
+ "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n",
79
+ "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.62.1)\n",
80
+ "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.5.0)\n",
81
+ "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n",
82
+ "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n",
83
+ "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n",
84
+ "Requirement already satisfied: nltk>=3.9 in /usr/local/lib/python3.12/dist-packages (from textblob) (3.9.1)\n",
85
+ "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (8.3.2)\n",
86
+ "Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (1.5.3)\n",
87
+ "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (2025.11.3)\n",
88
+ "Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (4.67.3)\n",
89
+ "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n"
90
+ ]
91
+ }
92
+ ],
93
+ "source": [
94
+ "!pip install beautifulsoup4 pandas matplotlib seaborn numpy textblob"
95
+ ]
96
+ },
97
+ {
98
+ "cell_type": "markdown",
99
+ "metadata": {
100
+ "id": "lquNYCbfL9IM"
101
+ },
102
+ "source": [
103
+ "## **2.** 🏨 Load hotel booking and hotel review datasets"
104
+ ]
105
+ },
106
+ {
107
+ "cell_type": "markdown",
108
+ "metadata": {
109
+ "id": "0IWuNpxxYDJF"
110
+ },
111
+ "source": [
112
+ "### *a. Initial setup*\n",
113
+ "Define the base url of the website you will scrape as well as how and what you will scrape"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": 22,
119
+ "metadata": {
120
+ "id": "91d52125",
121
+ "colab": {
122
+ "base_uri": "https://localhost:8080/"
123
+ },
124
+ "outputId": "84a7ad0f-fbd2-45f0-d4ce-495f24390ee4"
125
+ },
126
+ "outputs": [
127
+ {
128
+ "output_type": "stream",
129
+ "name": "stdout",
130
+ "text": [
131
+ " hotel is_canceled lead_time arrival_date_year arrival_date_month \\\n",
132
+ "0 Resort Hotel 0 342 2015 July \n",
133
+ "1 Resort Hotel 0 737 2015 July \n",
134
+ "2 Resort Hotel 0 7 2015 July \n",
135
+ "3 Resort Hotel 0 13 2015 July \n",
136
+ "4 Resort Hotel 0 14 2015 July \n",
137
+ "\n",
138
+ " arrival_date_week_number arrival_date_day_of_month \\\n",
139
+ "0 27 1 \n",
140
+ "1 27 1 \n",
141
+ "2 27 1 \n",
142
+ "3 27 1 \n",
143
+ "4 27 1 \n",
144
+ "\n",
145
+ " stays_in_weekend_nights stays_in_week_nights adults ... deposit_type \\\n",
146
+ "0 0 0 2 ... No Deposit \n",
147
+ "1 0 0 2 ... No Deposit \n",
148
+ "2 0 1 1 ... No Deposit \n",
149
+ "3 0 1 1 ... No Deposit \n",
150
+ "4 0 2 2 ... No Deposit \n",
151
+ "\n",
152
+ " agent company days_in_waiting_list customer_type adr \\\n",
153
+ "0 NaN NaN 0 Transient 0.0 \n",
154
+ "1 NaN NaN 0 Transient 0.0 \n",
155
+ "2 NaN NaN 0 Transient 75.0 \n",
156
+ "3 304.0 NaN 0 Transient 75.0 \n",
157
+ "4 240.0 NaN 0 Transient 98.0 \n",
158
+ "\n",
159
+ " required_car_parking_spaces total_of_special_requests reservation_status \\\n",
160
+ "0 0 0 Check-Out \n",
161
+ "1 0 0 Check-Out \n",
162
+ "2 0 0 Check-Out \n",
163
+ "3 0 0 Check-Out \n",
164
+ "4 0 1 Check-Out \n",
165
+ "\n",
166
+ " reservation_status_date \n",
167
+ "0 2015-07-01 \n",
168
+ "1 2015-07-01 \n",
169
+ "2 2015-07-02 \n",
170
+ "3 2015-07-02 \n",
171
+ "4 2015-07-03 \n",
172
+ "\n",
173
+ "[5 rows x 32 columns]\n",
174
+ " name city reviews.date reviews.rating \\\n",
175
+ "0 Hotel Russo Palace Mableton 2013-09-22 00:00:00+00:00 4.0 \n",
176
+ "1 Hotel Russo Palace Mableton 2015-04-03 00:00:00+00:00 5.0 \n",
177
+ "2 Hotel Russo Palace Mableton 2014-05-13 00:00:00+00:00 5.0 \n",
178
+ "3 Hotel Russo Palace Mableton 2013-10-27 00:00:00+00:00 5.0 \n",
179
+ "4 Hotel Russo Palace Mableton 2015-03-05 00:00:00+00:00 5.0 \n",
180
+ "\n",
181
+ " reviews.text hotel_type \n",
182
+ "0 Pleasant 10 min walk along the sea front to th... Resort Hotel \n",
183
+ "1 Really lovely hotel. Stayed on the very top fl... Resort Hotel \n",
184
+ "2 Ett mycket bra hotell. Det som drog ner betyge... Resort Hotel \n",
185
+ "3 We stayed here for four nights in October. The... Resort Hotel \n",
186
+ "4 We stayed here for four nights in October. The... Resort Hotel \n"
187
+ ]
188
+ }
189
+ ],
190
+ "source": [
191
+ "import pandas as pd\n",
192
+ "import numpy as np\n",
193
+ "import random\n",
194
+ "import time\n",
195
+ "\n",
196
+ "# Load hotel datasets\n",
197
+ "bookings = pd.read_csv(\"hotel_bookings.csv\")\n",
198
+ "reviews_raw = pd.read_csv(\"7282_1.csv\")\n",
199
+ "\n",
200
+ "# Keep useful columns from reviews\n",
201
+ "reviews_raw = reviews_raw[[\n",
202
+ " \"name\",\n",
203
+ " \"city\",\n",
204
+ " \"reviews.date\",\n",
205
+ " \"reviews.rating\",\n",
206
+ " \"reviews.text\"\n",
207
+ "]].copy()\n",
208
+ "\n",
209
+ "reviews_raw = reviews_raw.dropna(subset=[\"name\", \"reviews.text\", \"reviews.rating\"])\n",
210
+ "reviews_raw[\"reviews.date\"] = pd.to_datetime(reviews_raw[\"reviews.date\"], errors=\"coerce\")\n",
211
+ "reviews_raw = reviews_raw.dropna(subset=[\"reviews.date\"])\n",
212
+ "\n",
213
+ "# Create hotel type from review dataset\n",
214
+ "def classify_hotel_type(name, city):\n",
215
+ " text = (str(name) + \" \" + str(city)).lower()\n",
216
+ " resort_words = [\"resort\", \"spa\", \"beach\", \"island\", \"sea\", \"pool\", \"palace\"]\n",
217
+ " for word in resort_words:\n",
218
+ " if word in text:\n",
219
+ " return \"Resort Hotel\"\n",
220
+ " return \"City Hotel\"\n",
221
+ "\n",
222
+ "reviews_raw[\"hotel_type\"] = reviews_raw.apply(\n",
223
+ " lambda row: classify_hotel_type(row[\"name\"], row[\"city\"]),\n",
224
+ " axis=1\n",
225
+ ")\n",
226
+ "\n",
227
+ "# Mean ADR by hotel type from booking dataset\n",
228
+ "adr_by_type = bookings.groupby(\"hotel\")[\"adr\"].mean().to_dict()\n",
229
+ "\n",
230
+ "# Create hotel-level dataset using review data\n",
231
+ "df_books = (\n",
232
+ " reviews_raw.groupby(\"name\", as_index=False)\n",
233
+ " .agg(\n",
234
+ " rating=(\"reviews.rating\", \"mean\"),\n",
235
+ " n_reviews=(\"reviews.text\", \"size\"),\n",
236
+ " hotel_type=(\"hotel_type\", lambda x: x.mode().iloc[0] if not x.mode().empty else \"City Hotel\"),\n",
237
+ " city=(\"city\", lambda x: x.mode().iloc[0] if not x.mode().empty else \"Unknown\")\n",
238
+ " )\n",
239
+ ")\n",
240
+ "\n",
241
+ "# Keep \"title\" column name for compatibility with notebook 2\n",
242
+ "df_books = df_books.rename(columns={\"name\": \"title\"})\n",
243
+ "\n",
244
+ "# Create a hotel price proxy based on real ADR\n",
245
+ "df_books[\"price\"] = df_books[\"hotel_type\"].map(adr_by_type)\n",
246
+ "df_books[\"price\"] = df_books[\"price\"] * np.random.uniform(0.85, 1.15, len(df_books))\n",
247
+ "\n",
248
+ "print(bookings.head())\n",
249
+ "print(reviews_raw.head())"
250
+ ]
251
+ },
252
+ {
253
+ "cell_type": "markdown",
254
+ "metadata": {
255
+ "id": "oCdTsin2Yfp3"
256
+ },
257
+ "source": [
258
+ "### *b. Build a hotel-level dataframe with title, price, and rating*"
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": 23,
264
+ "metadata": {
265
+ "id": "xqO5Y3dnYhxt",
266
+ "colab": {
267
+ "base_uri": "https://localhost:8080/"
268
+ },
269
+ "outputId": "3c6eb1f7-9812-4ce0-bd25-d9827a378de5"
270
+ },
271
+ "outputs": [
272
+ {
273
+ "output_type": "stream",
274
+ "name": "stdout",
275
+ "text": [
276
+ " title rating n_reviews hotel_type \\\n",
277
+ "0 1785 Inn 2.625000 16 City Hotel \n",
278
+ "1 1900 House 4.571429 14 City Hotel \n",
279
+ "2 40 Berkeley Hostel 3.329193 161 City Hotel \n",
280
+ "3 A Bed & Breakfast In Cambridge 3.574074 54 City Hotel \n",
281
+ "4 Acorn Motor Inn 3.750000 20 City Hotel \n",
282
+ "\n",
283
+ " city price \n",
284
+ "0 North Conway 118.821735 \n",
285
+ "1 Narragansett 117.644891 \n",
286
+ "2 Boston 105.233747 \n",
287
+ "3 Cambridge 93.780093 \n",
288
+ "4 Oak Harbor 100.787105 \n",
289
+ "(623, 6)\n"
290
+ ]
291
+ }
292
+ ],
293
+ "source": [
294
+ "# The hotel-level dataframe is already created in cell 5\n",
295
+ "print(df_books.head())\n",
296
+ "print(df_books.shape)"
297
+ ]
298
+ },
299
+ {
300
+ "cell_type": "markdown",
301
+ "metadata": {
302
+ "id": "T0TOeRC4Yrnn"
303
+ },
304
+ "source": [
305
+ "### *c. βœ‹πŸ»πŸ›‘β›”οΈ Use the hotel-level dataframe as df_books with \"title\", \"price\", and \"rating\"*"
306
+ ]
307
+ },
308
+ {
309
+ "cell_type": "code",
310
+ "execution_count": 24,
311
+ "metadata": {
312
+ "id": "l5FkkNhUYTHh",
313
+ "colab": {
314
+ "base_uri": "https://localhost:8080/",
315
+ "height": 206
316
+ },
317
+ "outputId": "26d2a8d5-97f1-4918-bb96-8034659868c5"
318
+ },
319
+ "outputs": [
320
+ {
321
+ "output_type": "execute_result",
322
+ "data": {
323
+ "text/plain": [
324
+ " title price rating n_reviews \\\n",
325
+ "0 1785 Inn 118.821735 2.625000 16 \n",
326
+ "1 1900 House 117.644891 4.571429 14 \n",
327
+ "2 40 Berkeley Hostel 105.233747 3.329193 161 \n",
328
+ "3 A Bed & Breakfast In Cambridge 93.780093 3.574074 54 \n",
329
+ "4 Acorn Motor Inn 100.787105 3.750000 20 \n",
330
+ "\n",
331
+ " hotel_type city \n",
332
+ "0 City Hotel North Conway \n",
333
+ "1 City Hotel Narragansett \n",
334
+ "2 City Hotel Boston \n",
335
+ "3 City Hotel Cambridge \n",
336
+ "4 City Hotel Oak Harbor "
337
+ ],
338
+ "text/html": [
339
+ "\n",
340
+ " <div id=\"df-5a668a67-89ae-489e-a278-cbe66cf5b6e5\" class=\"colab-df-container\">\n",
341
+ " <div>\n",
342
+ "<style scoped>\n",
343
+ " .dataframe tbody tr th:only-of-type {\n",
344
+ " vertical-align: middle;\n",
345
+ " }\n",
346
+ "\n",
347
+ " .dataframe tbody tr th {\n",
348
+ " vertical-align: top;\n",
349
+ " }\n",
350
+ "\n",
351
+ " .dataframe thead th {\n",
352
+ " text-align: right;\n",
353
+ " }\n",
354
+ "</style>\n",
355
+ "<table border=\"1\" class=\"dataframe\">\n",
356
+ " <thead>\n",
357
+ " <tr style=\"text-align: right;\">\n",
358
+ " <th></th>\n",
359
+ " <th>title</th>\n",
360
+ " <th>price</th>\n",
361
+ " <th>rating</th>\n",
362
+ " <th>n_reviews</th>\n",
363
+ " <th>hotel_type</th>\n",
364
+ " <th>city</th>\n",
365
+ " </tr>\n",
366
+ " </thead>\n",
367
+ " <tbody>\n",
368
+ " <tr>\n",
369
+ " <th>0</th>\n",
370
+ " <td>1785 Inn</td>\n",
371
+ " <td>118.821735</td>\n",
372
+ " <td>2.625000</td>\n",
373
+ " <td>16</td>\n",
374
+ " <td>City Hotel</td>\n",
375
+ " <td>North Conway</td>\n",
376
+ " </tr>\n",
377
+ " <tr>\n",
378
+ " <th>1</th>\n",
379
+ " <td>1900 House</td>\n",
380
+ " <td>117.644891</td>\n",
381
+ " <td>4.571429</td>\n",
382
+ " <td>14</td>\n",
383
+ " <td>City Hotel</td>\n",
384
+ " <td>Narragansett</td>\n",
385
+ " </tr>\n",
386
+ " <tr>\n",
387
+ " <th>2</th>\n",
388
+ " <td>40 Berkeley Hostel</td>\n",
389
+ " <td>105.233747</td>\n",
390
+ " <td>3.329193</td>\n",
391
+ " <td>161</td>\n",
392
+ " <td>City Hotel</td>\n",
393
+ " <td>Boston</td>\n",
394
+ " </tr>\n",
395
+ " <tr>\n",
396
+ " <th>3</th>\n",
397
+ " <td>A Bed &amp; Breakfast In Cambridge</td>\n",
398
+ " <td>93.780093</td>\n",
399
+ " <td>3.574074</td>\n",
400
+ " <td>54</td>\n",
401
+ " <td>City Hotel</td>\n",
402
+ " <td>Cambridge</td>\n",
403
+ " </tr>\n",
404
+ " <tr>\n",
405
+ " <th>4</th>\n",
406
+ " <td>Acorn Motor Inn</td>\n",
407
+ " <td>100.787105</td>\n",
408
+ " <td>3.750000</td>\n",
409
+ " <td>20</td>\n",
410
+ " <td>City Hotel</td>\n",
411
+ " <td>Oak Harbor</td>\n",
412
+ " </tr>\n",
413
+ " </tbody>\n",
414
+ "</table>\n",
415
+ "</div>\n",
416
+ " <div class=\"colab-df-buttons\">\n",
417
+ "\n",
418
+ " <div class=\"colab-df-container\">\n",
419
+ " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5a668a67-89ae-489e-a278-cbe66cf5b6e5')\"\n",
420
+ " title=\"Convert this dataframe to an interactive table.\"\n",
421
+ " style=\"display:none;\">\n",
422
+ "\n",
423
+ " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
424
+ " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
425
+ " </svg>\n",
426
+ " </button>\n",
427
+ "\n",
428
+ " <style>\n",
429
+ " .colab-df-container {\n",
430
+ " display:flex;\n",
431
+ " gap: 12px;\n",
432
+ " }\n",
433
+ "\n",
434
+ " .colab-df-convert {\n",
435
+ " background-color: #E8F0FE;\n",
436
+ " border: none;\n",
437
+ " border-radius: 50%;\n",
438
+ " cursor: pointer;\n",
439
+ " display: none;\n",
440
+ " fill: #1967D2;\n",
441
+ " height: 32px;\n",
442
+ " padding: 0 0 0 0;\n",
443
+ " width: 32px;\n",
444
+ " }\n",
445
+ "\n",
446
+ " .colab-df-convert:hover {\n",
447
+ " background-color: #E2EBFA;\n",
448
+ " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
449
+ " fill: #174EA6;\n",
450
+ " }\n",
451
+ "\n",
452
+ " .colab-df-buttons div {\n",
453
+ " margin-bottom: 4px;\n",
454
+ " }\n",
455
+ "\n",
456
+ " [theme=dark] .colab-df-convert {\n",
457
+ " background-color: #3B4455;\n",
458
+ " fill: #D2E3FC;\n",
459
+ " }\n",
460
+ "\n",
461
+ " [theme=dark] .colab-df-convert:hover {\n",
462
+ " background-color: #434B5C;\n",
463
+ " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
464
+ " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
465
+ " fill: #FFFFFF;\n",
466
+ " }\n",
467
+ " </style>\n",
468
+ "\n",
469
+ " <script>\n",
470
+ " const buttonEl =\n",
471
+ " document.querySelector('#df-5a668a67-89ae-489e-a278-cbe66cf5b6e5 button.colab-df-convert');\n",
472
+ " buttonEl.style.display =\n",
473
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
474
+ "\n",
475
+ " async function convertToInteractive(key) {\n",
476
+ " const element = document.querySelector('#df-5a668a67-89ae-489e-a278-cbe66cf5b6e5');\n",
477
+ " const dataTable =\n",
478
+ " await google.colab.kernel.invokeFunction('convertToInteractive',\n",
479
+ " [key], {});\n",
480
+ " if (!dataTable) return;\n",
481
+ "\n",
482
+ " const docLinkHtml = 'Like what you see? Visit the ' +\n",
483
+ " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
484
+ " + ' to learn more about interactive tables.';\n",
485
+ " element.innerHTML = '';\n",
486
+ " dataTable['output_type'] = 'display_data';\n",
487
+ " await google.colab.output.renderOutput(dataTable, element);\n",
488
+ " const docLink = document.createElement('div');\n",
489
+ " docLink.innerHTML = docLinkHtml;\n",
490
+ " element.appendChild(docLink);\n",
491
+ " }\n",
492
+ " </script>\n",
493
+ " </div>\n",
494
+ "\n",
495
+ "\n",
496
+ " </div>\n",
497
+ " </div>\n"
498
+ ],
499
+ "application/vnd.google.colaboratory.intrinsic+json": {
500
+ "type": "dataframe",
501
+ "variable_name": "df_books",
502
+ "summary": "{\n \"name\": \"df_books\",\n \"rows\": 623,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 623,\n \"samples\": [\n \"Hampton Inn Roanoke/salem\",\n \"Super 8 Metropolis\",\n \"Drury Inn and Suites Columbus Convention Center\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9.514494608519156,\n \"min\": 81.95384182332211,\n \"max\": 121.06048279685156,\n \"num_unique_values\": 623,\n \"samples\": [\n 97.03571613889734,\n 96.43920055033284,\n 112.37363684162978\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0470488242452078,\n \"min\": 0.0,\n \"max\": 8.368932038834952,\n \"num_unique_values\": 435,\n \"samples\": [\n 2.25,\n 3.1818181818181817,\n 4.235294117647059\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"n_reviews\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 80,\n \"min\": 1,\n \"max\": 1185,\n \"num_unique_values\": 157,\n \"samples\": [\n 714,\n 44,\n 156\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"hotel_type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Resort Hotel\",\n \"City Hotel\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"city\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 494,\n \"samples\": [\n \"Alexandria\",\n \"Detroit Lakes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
503
+ }
504
+ },
505
+ "metadata": {},
506
+ "execution_count": 24
507
+ }
508
+ ],
509
+ "source": [
510
+ "# df_books is already ready and contains title, price, and rating\n",
511
+ "df_books = df_books[[\"title\", \"price\", \"rating\", \"n_reviews\", \"hotel_type\", \"city\"]].copy()\n",
512
+ "df_books.head()"
513
+ ]
514
+ },
515
+ {
516
+ "cell_type": "markdown",
517
+ "metadata": {
518
+ "id": "duI5dv3CZYvF"
519
+ },
520
+ "source": [
521
+ "### *d. Save web-scraped dataframe either as a CSV or Excel file*"
522
+ ]
523
+ },
524
+ {
525
+ "cell_type": "code",
526
+ "execution_count": 25,
527
+ "metadata": {
528
+ "id": "lC1U_YHtZifh"
529
+ },
530
+ "outputs": [],
531
+ "source": [
532
+ "# Save hotel-level dataframe\n",
533
+ "df_books.to_csv(\"hotel_level_features.csv\", index=False)"
534
+ ]
535
+ },
536
+ {
537
+ "cell_type": "markdown",
538
+ "metadata": {
539
+ "id": "qMjRKMBQZlJi"
540
+ },
541
+ "source": [
542
+ "### *e. βœ‹πŸ»πŸ›‘β›”οΈ View first fiew lines*"
543
+ ]
544
+ },
545
+ {
546
+ "cell_type": "code",
547
+ "execution_count": 26,
548
+ "metadata": {
549
+ "colab": {
550
+ "base_uri": "https://localhost:8080/",
551
+ "height": 206
552
+ },
553
+ "id": "O_wIvTxYZqCK",
554
+ "outputId": "2f6edaf4-e853-4d9c-c2a1-c17502991c08"
555
+ },
556
+ "outputs": [
557
+ {
558
+ "output_type": "execute_result",
559
+ "data": {
560
+ "text/plain": [
561
+ " title price rating n_reviews \\\n",
562
+ "0 1785 Inn 118.821735 2.625000 16 \n",
563
+ "1 1900 House 117.644891 4.571429 14 \n",
564
+ "2 40 Berkeley Hostel 105.233747 3.329193 161 \n",
565
+ "3 A Bed & Breakfast In Cambridge 93.780093 3.574074 54 \n",
566
+ "4 Acorn Motor Inn 100.787105 3.750000 20 \n",
567
+ "\n",
568
+ " hotel_type city \n",
569
+ "0 City Hotel North Conway \n",
570
+ "1 City Hotel Narragansett \n",
571
+ "2 City Hotel Boston \n",
572
+ "3 City Hotel Cambridge \n",
573
+ "4 City Hotel Oak Harbor "
574
+ ],
575
+ "text/html": [
576
+ "\n",
577
+ " <div id=\"df-68f009ad-80be-4ef7-b9a7-334999a5cb00\" class=\"colab-df-container\">\n",
578
+ " <div>\n",
579
+ "<style scoped>\n",
580
+ " .dataframe tbody tr th:only-of-type {\n",
581
+ " vertical-align: middle;\n",
582
+ " }\n",
583
+ "\n",
584
+ " .dataframe tbody tr th {\n",
585
+ " vertical-align: top;\n",
586
+ " }\n",
587
+ "\n",
588
+ " .dataframe thead th {\n",
589
+ " text-align: right;\n",
590
+ " }\n",
591
+ "</style>\n",
592
+ "<table border=\"1\" class=\"dataframe\">\n",
593
+ " <thead>\n",
594
+ " <tr style=\"text-align: right;\">\n",
595
+ " <th></th>\n",
596
+ " <th>title</th>\n",
597
+ " <th>price</th>\n",
598
+ " <th>rating</th>\n",
599
+ " <th>n_reviews</th>\n",
600
+ " <th>hotel_type</th>\n",
601
+ " <th>city</th>\n",
602
+ " </tr>\n",
603
+ " </thead>\n",
604
+ " <tbody>\n",
605
+ " <tr>\n",
606
+ " <th>0</th>\n",
607
+ " <td>1785 Inn</td>\n",
608
+ " <td>118.821735</td>\n",
609
+ " <td>2.625000</td>\n",
610
+ " <td>16</td>\n",
611
+ " <td>City Hotel</td>\n",
612
+ " <td>North Conway</td>\n",
613
+ " </tr>\n",
614
+ " <tr>\n",
615
+ " <th>1</th>\n",
616
+ " <td>1900 House</td>\n",
617
+ " <td>117.644891</td>\n",
618
+ " <td>4.571429</td>\n",
619
+ " <td>14</td>\n",
620
+ " <td>City Hotel</td>\n",
621
+ " <td>Narragansett</td>\n",
622
+ " </tr>\n",
623
+ " <tr>\n",
624
+ " <th>2</th>\n",
625
+ " <td>40 Berkeley Hostel</td>\n",
626
+ " <td>105.233747</td>\n",
627
+ " <td>3.329193</td>\n",
628
+ " <td>161</td>\n",
629
+ " <td>City Hotel</td>\n",
630
+ " <td>Boston</td>\n",
631
+ " </tr>\n",
632
+ " <tr>\n",
633
+ " <th>3</th>\n",
634
+ " <td>A Bed &amp; Breakfast In Cambridge</td>\n",
635
+ " <td>93.780093</td>\n",
636
+ " <td>3.574074</td>\n",
637
+ " <td>54</td>\n",
638
+ " <td>City Hotel</td>\n",
639
+ " <td>Cambridge</td>\n",
640
+ " </tr>\n",
641
+ " <tr>\n",
642
+ " <th>4</th>\n",
643
+ " <td>Acorn Motor Inn</td>\n",
644
+ " <td>100.787105</td>\n",
645
+ " <td>3.750000</td>\n",
646
+ " <td>20</td>\n",
647
+ " <td>City Hotel</td>\n",
648
+ " <td>Oak Harbor</td>\n",
649
+ " </tr>\n",
650
+ " </tbody>\n",
651
+ "</table>\n",
652
+ "</div>\n",
653
+ " <div class=\"colab-df-buttons\">\n",
654
+ "\n",
655
+ " <div class=\"colab-df-container\">\n",
656
+ " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-68f009ad-80be-4ef7-b9a7-334999a5cb00')\"\n",
657
+ " title=\"Convert this dataframe to an interactive table.\"\n",
658
+ " style=\"display:none;\">\n",
659
+ "\n",
660
+ " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
661
+ " <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
662
+ " </svg>\n",
663
+ " </button>\n",
664
+ "\n",
665
+ " <style>\n",
666
+ " .colab-df-container {\n",
667
+ " display:flex;\n",
668
+ " gap: 12px;\n",
669
+ " }\n",
670
+ "\n",
671
+ " .colab-df-convert {\n",
672
+ " background-color: #E8F0FE;\n",
673
+ " border: none;\n",
674
+ " border-radius: 50%;\n",
675
+ " cursor: pointer;\n",
676
+ " display: none;\n",
677
+ " fill: #1967D2;\n",
678
+ " height: 32px;\n",
679
+ " padding: 0 0 0 0;\n",
680
+ " width: 32px;\n",
681
+ " }\n",
682
+ "\n",
683
+ " .colab-df-convert:hover {\n",
684
+ " background-color: #E2EBFA;\n",
685
+ " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
686
+ " fill: #174EA6;\n",
687
+ " }\n",
688
+ "\n",
689
+ " .colab-df-buttons div {\n",
690
+ " margin-bottom: 4px;\n",
691
+ " }\n",
692
+ "\n",
693
+ " [theme=dark] .colab-df-convert {\n",
694
+ " background-color: #3B4455;\n",
695
+ " fill: #D2E3FC;\n",
696
+ " }\n",
697
+ "\n",
698
+ " [theme=dark] .colab-df-convert:hover {\n",
699
+ " background-color: #434B5C;\n",
700
+ " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
701
+ " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
702
+ " fill: #FFFFFF;\n",
703
+ " }\n",
704
+ " </style>\n",
705
+ "\n",
706
+ " <script>\n",
707
+ " const buttonEl =\n",
708
+ " document.querySelector('#df-68f009ad-80be-4ef7-b9a7-334999a5cb00 button.colab-df-convert');\n",
709
+ " buttonEl.style.display =\n",
710
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
711
+ "\n",
712
+ " async function convertToInteractive(key) {\n",
713
+ " const element = document.querySelector('#df-68f009ad-80be-4ef7-b9a7-334999a5cb00');\n",
714
+ " const dataTable =\n",
715
+ " await google.colab.kernel.invokeFunction('convertToInteractive',\n",
716
+ " [key], {});\n",
717
+ " if (!dataTable) return;\n",
718
+ "\n",
719
+ " const docLinkHtml = 'Like what you see? Visit the ' +\n",
720
+ " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
721
+ " + ' to learn more about interactive tables.';\n",
722
+ " element.innerHTML = '';\n",
723
+ " dataTable['output_type'] = 'display_data';\n",
724
+ " await google.colab.output.renderOutput(dataTable, element);\n",
725
+ " const docLink = document.createElement('div');\n",
726
+ " docLink.innerHTML = docLinkHtml;\n",
727
+ " element.appendChild(docLink);\n",
728
+ " }\n",
729
+ " </script>\n",
730
+ " </div>\n",
731
+ "\n",
732
+ "\n",
733
+ " </div>\n",
734
+ " </div>\n"
735
+ ],
736
+ "application/vnd.google.colaboratory.intrinsic+json": {
737
+ "type": "dataframe",
738
+ "variable_name": "df_books",
739
+ "summary": "{\n \"name\": \"df_books\",\n \"rows\": 623,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 623,\n \"samples\": [\n \"Hampton Inn Roanoke/salem\",\n \"Super 8 Metropolis\",\n \"Drury Inn and Suites Columbus Convention Center\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9.514494608519156,\n \"min\": 81.95384182332211,\n \"max\": 121.06048279685156,\n \"num_unique_values\": 623,\n \"samples\": [\n 97.03571613889734,\n 96.43920055033284,\n 112.37363684162978\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0470488242452078,\n \"min\": 0.0,\n \"max\": 8.368932038834952,\n \"num_unique_values\": 435,\n \"samples\": [\n 2.25,\n 3.1818181818181817,\n 4.235294117647059\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"n_reviews\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 80,\n \"min\": 1,\n \"max\": 1185,\n \"num_unique_values\": 157,\n \"samples\": [\n 714,\n 44,\n 156\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"hotel_type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Resort Hotel\",\n \"City Hotel\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"city\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 494,\n \"samples\": [\n \"Alexandria\",\n \"Detroit Lakes\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
740
+ }
741
+ },
742
+ "metadata": {},
743
+ "execution_count": 26
744
+ }
745
+ ],
746
+ "source": [
747
+ "df_books.head()"
748
+ ]
749
+ },
750
+ {
751
+ "cell_type": "markdown",
752
+ "metadata": {
753
+ "id": "p-1Pr2szaqLk"
754
+ },
755
+ "source": [
756
+ "## **3.** 🧩 Create a meaningful connection between real & synthetic datasets"
757
+ ]
758
+ },
759
+ {
760
+ "cell_type": "markdown",
761
+ "metadata": {
762
+ "id": "SIaJUGIpaH4V"
763
+ },
764
+ "source": [
765
+ "### *a. Initial setup*"
766
+ ]
767
+ },
768
+ {
769
+ "cell_type": "code",
770
+ "execution_count": 27,
771
+ "metadata": {
772
+ "id": "-gPXGcRPuV_9"
773
+ },
774
+ "outputs": [],
775
+ "source": [
776
+ "import numpy as np\n",
777
+ "import random\n",
778
+ "from datetime import datetime\n",
779
+ "import warnings\n",
780
+ "\n",
781
+ "warnings.filterwarnings(\"ignore\")\n",
782
+ "random.seed(2025)\n",
783
+ "np.random.seed(2025)"
784
+ ]
785
+ },
786
+ {
787
+ "cell_type": "markdown",
788
+ "metadata": {
789
+ "id": "pY4yCoIuaQqp"
790
+ },
791
+ "source": [
792
+ "### *b. Generate popularity scores based on hotel rating and review volume*"
793
+ ]
794
+ },
795
+ {
796
+ "cell_type": "code",
797
+ "execution_count": 28,
798
+ "metadata": {
799
+ "id": "mnd5hdAbaNjz"
800
+ },
801
+ "outputs": [],
802
+ "source": [
803
+ "def generate_popularity_score(avg_rating, n_reviews):\n",
804
+ " base = round(avg_rating)\n",
805
+ "\n",
806
+ " if n_reviews >= 20:\n",
807
+ " volume_bonus = 1\n",
808
+ " else:\n",
809
+ " volume_bonus = 0\n",
810
+ "\n",
811
+ " noise = random.choice([-1, 0, 0, 1])\n",
812
+ "\n",
813
+ " return int(np.clip(base + volume_bonus + noise, 1, 5))"
814
+ ]
815
+ },
816
+ {
817
+ "cell_type": "markdown",
818
+ "metadata": {
819
+ "id": "n4-TaNTFgPak"
820
+ },
821
+ "source": [
822
+ "### *c. βœ‹πŸ»πŸ›‘β›”οΈ Create a \"popularity_score\" column from rating and number of reviews*"
823
+ ]
824
+ },
825
+ {
826
+ "cell_type": "code",
827
+ "execution_count": 29,
828
+ "metadata": {
829
+ "id": "V-G3OCUCgR07"
830
+ },
831
+ "outputs": [],
832
+ "source": [
833
+ "df_books[\"popularity_score\"] = df_books.apply(\n",
834
+ " lambda row: generate_popularity_score(row[\"rating\"], row[\"n_reviews\"]),\n",
835
+ " axis=1\n",
836
+ ")"
837
+ ]
838
+ },
839
+ {
840
+ "cell_type": "markdown",
841
+ "metadata": {
842
+ "id": "HnngRNTgacYt"
843
+ },
844
+ "source": [
845
+ "### *d. Decide on the sentiment_label based on the popularity score with a get_sentiment function*"
846
+ ]
847
+ },
848
+ {
849
+ "cell_type": "code",
850
+ "execution_count": 30,
851
+ "metadata": {
852
+ "id": "kUtWmr8maZLZ"
853
+ },
854
+ "outputs": [],
855
+ "source": [
856
+ "def get_sentiment(popularity_score):\n",
857
+ " if popularity_score <= 2:\n",
858
+ " return \"negative\"\n",
859
+ " elif popularity_score == 3:\n",
860
+ " return \"neutral\"\n",
861
+ " else:\n",
862
+ " return \"positive\""
863
+ ]
864
+ },
865
+ {
866
+ "cell_type": "markdown",
867
+ "metadata": {
868
+ "id": "HF9F9HIzgT7Z"
869
+ },
870
+ "source": [
871
+ "### *e. βœ‹πŸ»πŸ›‘β›”οΈ Run the function to create a \"sentiment_label\" column from \"popularity_score\"*"
872
+ ]
873
+ },
874
+ {
875
+ "cell_type": "code",
876
+ "execution_count": 31,
877
+ "metadata": {
878
+ "id": "tafQj8_7gYCG"
879
+ },
880
+ "outputs": [],
881
+ "source": [
882
+ "df_books[\"sentiment_label\"] = df_books[\"popularity_score\"].apply(get_sentiment)"
883
+ ]
884
+ },
885
+ {
886
+ "cell_type": "markdown",
887
+ "metadata": {
888
+ "id": "T8AdKkmASq9a"
889
+ },
890
+ "source": [
891
+ "## **4.** πŸ“ˆ Generate synthetic hotel demand data for 18 months"
892
+ ]
893
+ },
894
+ {
895
+ "cell_type": "markdown",
896
+ "metadata": {
897
+ "id": "OhXbdGD5fH0c"
898
+ },
899
+ "source": [
900
+ "### *a. Create a generate_sales_profile function based on hotel type and popularity*"
901
+ ]
902
+ },
903
+ {
904
+ "cell_type": "code",
905
+ "execution_count": 32,
906
+ "metadata": {
907
+ "id": "qkVhYPXGbgEn"
908
+ },
909
+ "outputs": [],
910
+ "source": [
911
+ "from datetime import datetime\n",
912
+ "\n",
913
+ "# Build a real monthly baseline from booking data\n",
914
+ "bookings[\"month_date\"] = pd.to_datetime(\n",
915
+ " bookings[\"arrival_date_month\"] + \" \" + bookings[\"arrival_date_year\"].astype(str),\n",
916
+ " format=\"%B %Y\",\n",
917
+ " errors=\"coerce\"\n",
918
+ ")\n",
919
+ "\n",
920
+ "bookings = bookings.dropna(subset=[\"month_date\"])\n",
921
+ "bookings[\"month_num\"] = bookings[\"month_date\"].dt.month\n",
922
+ "\n",
923
+ "monthly_baseline = (\n",
924
+ " bookings.groupby([\"hotel\", \"month_num\"])\n",
925
+ " .size()\n",
926
+ " .reset_index(name=\"base_demand\")\n",
927
+ ")\n",
928
+ "\n",
929
+ "def generate_sales_profile(hotel_type, popularity_score):\n",
930
+ " months = pd.date_range(end=datetime.today(), periods=18, freq=\"M\")\n",
931
+ " hotel_baseline = monthly_baseline[monthly_baseline[\"hotel\"] == hotel_type]\n",
932
+ "\n",
933
+ " if hotel_baseline.empty:\n",
934
+ " base_mean = 100\n",
935
+ " else:\n",
936
+ " base_mean = hotel_baseline[\"base_demand\"].mean()\n",
937
+ "\n",
938
+ " multiplier_map = {\n",
939
+ " 1: 0.6,\n",
940
+ " 2: 0.8,\n",
941
+ " 3: 1.0,\n",
942
+ " 4: 1.2,\n",
943
+ " 5: 1.4\n",
944
+ " }\n",
945
+ "\n",
946
+ " popularity_multiplier = multiplier_map.get(popularity_score, 1.0)\n",
947
+ "\n",
948
+ " records = []\n",
949
+ " for month in months:\n",
950
+ " month_num = month.month\n",
951
+ "\n",
952
+ " month_row = hotel_baseline[hotel_baseline[\"month_num\"] == month_num]\n",
953
+ " if not month_row.empty:\n",
954
+ " seasonal_base = month_row[\"base_demand\"].values[0]\n",
955
+ " else:\n",
956
+ " seasonal_base = base_mean\n",
957
+ "\n",
958
+ " units = max(\n",
959
+ " 5,\n",
960
+ " int(np.random.normal((seasonal_base / 40) * popularity_multiplier, 5))\n",
961
+ " )\n",
962
+ "\n",
963
+ " records.append((month.strftime(\"%Y-%m\"), units))\n",
964
+ "\n",
965
+ " return records"
966
+ ]
967
+ },
968
+ {
969
+ "cell_type": "markdown",
970
+ "metadata": {
971
+ "id": "L2ak1HlcgoTe"
972
+ },
973
+ "source": [
974
+ "### *b. Build sales_data using hotel_type and popularity_score*"
975
+ ]
976
+ },
977
+ {
978
+ "cell_type": "code",
979
+ "execution_count": 33,
980
+ "metadata": {
981
+ "id": "SlJ24AUafoDB"
982
+ },
983
+ "outputs": [],
984
+ "source": [
985
+ "sales_data = []\n",
986
+ "\n",
987
+ "for _, row in df_books.iterrows():\n",
988
+ " records = generate_sales_profile(row[\"hotel_type\"], row[\"popularity_score\"])\n",
989
+ " for month, units in records:\n",
990
+ " sales_data.append({\n",
991
+ " \"title\": row[\"title\"],\n",
992
+ " \"month\": month,\n",
993
+ " \"units_sold\": units,\n",
994
+ " \"sentiment_label\": row[\"sentiment_label\"]\n",
995
+ " })"
996
+ ]
997
+ },
998
+ {
999
+ "cell_type": "markdown",
1000
+ "metadata": {
1001
+ "id": "4IXZKcCSgxnq"
1002
+ },
1003
+ "source": [
1004
+ "### *c. βœ‹πŸ»πŸ›‘β›”οΈ Create a df_sales DataFrame from sales_data*"
1005
+ ]
1006
+ },
1007
+ {
1008
+ "cell_type": "code",
1009
+ "execution_count": 34,
1010
+ "metadata": {
1011
+ "id": "wcN6gtiZg-ws"
1012
+ },
1013
+ "outputs": [],
1014
+ "source": [
1015
+ "df_sales = pd.DataFrame(sales_data)"
1016
+ ]
1017
+ },
1018
+ {
1019
+ "cell_type": "markdown",
1020
+ "metadata": {
1021
+ "id": "EhIjz9WohAmZ"
1022
+ },
1023
+ "source": [
1024
+ "### *d. Save df_sales as synthetic_sales_data.csv & view first few lines*"
1025
+ ]
1026
+ },
1027
+ {
1028
+ "cell_type": "code",
1029
+ "execution_count": 35,
1030
+ "metadata": {
1031
+ "colab": {
1032
+ "base_uri": "https://localhost:8080/"
1033
+ },
1034
+ "id": "MzbZvLcAhGaH",
1035
+ "outputId": "7975a59e-178e-4d25-98f5-02d90cbd97b0"
1036
+ },
1037
+ "outputs": [
1038
+ {
1039
+ "output_type": "stream",
1040
+ "name": "stdout",
1041
+ "text": [
1042
+ " title month units_sold sentiment_label\n",
1043
+ "0 1785 Inn 2024-10 151 negative\n",
1044
+ "1 1785 Inn 2024-11 90 negative\n",
1045
+ "2 1785 Inn 2024-12 75 negative\n",
1046
+ "3 1785 Inn 2025-01 71 negative\n",
1047
+ "4 1785 Inn 2025-02 98 negative\n"
1048
+ ]
1049
+ }
1050
+ ],
1051
+ "source": [
1052
+ "df_sales.to_csv(\"synthetic_sales_data.csv\", index=False)\n",
1053
+ "\n",
1054
+ "print(df_sales.head())"
1055
+ ]
1056
+ },
1057
+ {
1058
+ "cell_type": "markdown",
1059
+ "metadata": {
1060
+ "id": "7g9gqBgQMtJn"
1061
+ },
1062
+ "source": [
1063
+ "## **5.** 🎯 Generate synthetic customer review dataset using hotel reviews"
1064
+ ]
1065
+ },
1066
+ {
1067
+ "cell_type": "markdown",
1068
+ "metadata": {
1069
+ "id": "Gi4y9M9KuDWx"
1070
+ },
1071
+ "source": [
1072
+ "### *a. βœ‹πŸ»πŸ›‘β›”οΈ Create fallback review texts for each sentiment label*"
1073
+ ]
1074
+ },
1075
+ {
1076
+ "cell_type": "code",
1077
+ "execution_count": 36,
1078
+ "metadata": {
1079
+ "id": "b3cd2a50"
1080
+ },
1081
+ "outputs": [],
1082
+ "source": [
1083
+ "synthetic_reviews_by_sentiment = {\n",
1084
+ " \"positive\": [\n",
1085
+ " \"The hotel was excellent and the overall experience was very satisfying.\",\n",
1086
+ " \"Very good service, clean rooms, and a pleasant stay.\",\n",
1087
+ " \"A great experience with friendly staff and strong service quality.\",\n",
1088
+ " \"The hotel exceeded expectations and the stay was very comfortable.\",\n",
1089
+ " \"Excellent service and a very enjoyable overall experience.\",\n",
1090
+ " \"The rooms were clean and the hotel atmosphere was very pleasant.\",\n",
1091
+ " \"A very satisfying stay with professional staff and good facilities.\",\n",
1092
+ " \"The hotel experience was smooth, comfortable, and enjoyable.\",\n",
1093
+ " \"Strong service quality and a very positive stay overall.\",\n",
1094
+ " \"The hotel offered a high-quality experience from start to finish.\"\n",
1095
+ " ],\n",
1096
+ " \"neutral\": [\n",
1097
+ " \"The stay was acceptable but not especially memorable.\",\n",
1098
+ " \"The hotel was average and the experience was correct overall.\",\n",
1099
+ " \"Some aspects were good, while others could be improved.\",\n",
1100
+ " \"The stay was fine but quite standard.\",\n",
1101
+ " \"The hotel met expectations without standing out.\",\n",
1102
+ " \"The overall experience was balanced, with both positive and negative points.\",\n",
1103
+ " \"The stay was decent and the service was acceptable.\",\n",
1104
+ " \"Nothing was particularly bad, but nothing was exceptional either.\",\n",
1105
+ " \"The experience was normal and relatively satisfactory.\",\n",
1106
+ " \"The hotel was reasonable but could improve in some areas.\"\n",
1107
+ " ],\n",
1108
+ " \"negative\": [\n",
1109
+ " \"The experience was disappointing and the service could be improved.\",\n",
1110
+ " \"The hotel did not fully meet expectations.\",\n",
1111
+ " \"Several aspects of the stay were below standard.\",\n",
1112
+ " \"The service quality was disappointing during the stay.\",\n",
1113
+ " \"The overall hotel experience was less satisfying than expected.\",\n",
1114
+ " \"Some important elements of the stay need improvement.\",\n",
1115
+ " \"The hotel experience was not fully satisfactory.\",\n",
1116
+ " \"The service and comfort level were below expectations.\",\n",
1117
+ " \"The stay had several weak points and was disappointing overall.\",\n",
1118
+ " \"The customer experience should be improved in future.\"\n",
1119
+ " ]\n",
1120
+ "}"
1121
+ ]
1122
+ },
1123
+ {
1124
+ "cell_type": "markdown",
1125
+ "metadata": {
1126
+ "id": "fQhfVaDmuULT"
1127
+ },
1128
+ "source": [
1129
+ "### *b. Generate 10 reviews per hotel using real hotel reviews when available*"
1130
+ ]
1131
+ },
1132
+ {
1133
+ "cell_type": "code",
1134
+ "execution_count": 37,
1135
+ "metadata": {
1136
+ "id": "l2SRc3PjuTGM"
1137
+ },
1138
+ "outputs": [],
1139
+ "source": [
1140
+ "review_rows = []\n",
1141
+ "\n",
1142
+ "for _, row in df_books.iterrows():\n",
1143
+ " hotel_name = row[\"title\"]\n",
1144
+ " sentiment_label = row[\"sentiment_label\"]\n",
1145
+ "\n",
1146
+ " hotel_reviews = reviews_raw[reviews_raw[\"name\"] == hotel_name][\"reviews.text\"].dropna().tolist()\n",
1147
+ "\n",
1148
+ " if len(hotel_reviews) >= 10:\n",
1149
+ " sampled_reviews = random.sample(hotel_reviews, 10)\n",
1150
+ " elif len(hotel_reviews) > 0:\n",
1151
+ " sampled_reviews = hotel_reviews.copy()\n",
1152
+ " while len(sampled_reviews) < 10:\n",
1153
+ " sampled_reviews.append(random.choice(hotel_reviews))\n",
1154
+ " else:\n",
1155
+ " sampled_reviews = [random.choice(synthetic_reviews_by_sentiment[sentiment_label]) for _ in range(10)]\n",
1156
+ "\n",
1157
+ " for review_text in sampled_reviews[:10]:\n",
1158
+ " review_rows.append({\n",
1159
+ " \"title\": hotel_name,\n",
1160
+ " \"sentiment_label\": sentiment_label,\n",
1161
+ " \"review_text\": review_text,\n",
1162
+ " \"rating\": row[\"rating\"],\n",
1163
+ " \"popularity_score\": row[\"popularity_score\"]\n",
1164
+ " })"
1165
+ ]
1166
+ },
1167
+ {
1168
+ "cell_type": "markdown",
1169
+ "metadata": {
1170
+ "id": "bmJMXF-Bukdm"
1171
+ },
1172
+ "source": [
1173
+ "### *c. Create the final dataframe df_reviews & save it as synthetic_book_reviews.csv*"
1174
+ ]
1175
+ },
1176
+ {
1177
+ "cell_type": "code",
1178
+ "execution_count": 38,
1179
+ "metadata": {
1180
+ "id": "ZUKUqZsuumsp"
1181
+ },
1182
+ "outputs": [],
1183
+ "source": [
1184
+ "df_reviews = pd.DataFrame(review_rows)\n",
1185
+ "df_reviews.to_csv(\"synthetic_book_reviews.csv\", index=False)"
1186
+ ]
1187
+ },
1188
+ {
1189
+ "cell_type": "code",
1190
+ "execution_count": 39,
1191
+ "metadata": {
1192
+ "colab": {
1193
+ "base_uri": "https://localhost:8080/"
1194
+ },
1195
+ "id": "3946e521",
1196
+ "outputId": "89a60601-d358-4f6a-b789-5e81b04ca222"
1197
+ },
1198
+ "outputs": [
1199
+ {
1200
+ "output_type": "stream",
1201
+ "name": "stdout",
1202
+ "text": [
1203
+ "βœ… Wrote synthetic_title_level_features.csv\n",
1204
+ "βœ… Wrote synthetic_monthly_revenue_series.csv\n"
1205
+ ]
1206
+ }
1207
+ ],
1208
+ "source": [
1209
+ "\n",
1210
+ "# ============================================================\n",
1211
+ "# βœ… Create \"R-ready\" derived inputs (root-level files)\n",
1212
+ "# ============================================================\n",
1213
+ "# These two files make the R notebook robust and fast:\n",
1214
+ "# 1) synthetic_title_level_features.csv -> regression-ready, one row per title\n",
1215
+ "# 2) synthetic_monthly_revenue_series.csv -> forecasting-ready, one row per month\n",
1216
+ "\n",
1217
+ "import numpy as np\n",
1218
+ "\n",
1219
+ "def _safe_num(s):\n",
1220
+ " return pd.to_numeric(\n",
1221
+ " pd.Series(s).astype(str).str.replace(r\"[^0-9.]\", \"\", regex=True),\n",
1222
+ " errors=\"coerce\"\n",
1223
+ " )\n",
1224
+ "\n",
1225
+ "# --- Clean hotel metadata (price/rating) ---\n",
1226
+ "df_books_r = df_books.copy()\n",
1227
+ "if \"price\" in df_books_r.columns:\n",
1228
+ " df_books_r[\"price\"] = _safe_num(df_books_r[\"price\"])\n",
1229
+ "if \"rating\" in df_books_r.columns:\n",
1230
+ " df_books_r[\"rating\"] = _safe_num(df_books_r[\"rating\"])\n",
1231
+ "\n",
1232
+ "df_books_r[\"title\"] = df_books_r[\"title\"].astype(str).str.strip()\n",
1233
+ "\n",
1234
+ "# --- Clean sales ---\n",
1235
+ "df_sales_r = df_sales.copy()\n",
1236
+ "df_sales_r[\"title\"] = df_sales_r[\"title\"].astype(str).str.strip()\n",
1237
+ "df_sales_r[\"month\"] = pd.to_datetime(df_sales_r[\"month\"], errors=\"coerce\")\n",
1238
+ "df_sales_r[\"units_sold\"] = _safe_num(df_sales_r[\"units_sold\"])\n",
1239
+ "\n",
1240
+ "# --- Clean reviews ---\n",
1241
+ "df_reviews_r = df_reviews.copy()\n",
1242
+ "df_reviews_r[\"title\"] = df_reviews_r[\"title\"].astype(str).str.strip()\n",
1243
+ "df_reviews_r[\"sentiment_label\"] = df_reviews_r[\"sentiment_label\"].astype(str).str.lower().str.strip()\n",
1244
+ "if \"rating\" in df_reviews_r.columns:\n",
1245
+ " df_reviews_r[\"rating\"] = _safe_num(df_reviews_r[\"rating\"])\n",
1246
+ "if \"popularity_score\" in df_reviews_r.columns:\n",
1247
+ " df_reviews_r[\"popularity_score\"] = _safe_num(df_reviews_r[\"popularity_score\"])\n",
1248
+ "\n",
1249
+ "# --- Sentiment shares per title (from reviews) ---\n",
1250
+ "sent_counts = (\n",
1251
+ " df_reviews_r.groupby([\"title\", \"sentiment_label\"])\n",
1252
+ " .size()\n",
1253
+ " .unstack(fill_value=0)\n",
1254
+ ")\n",
1255
+ "for lab in [\"positive\", \"neutral\", \"negative\"]:\n",
1256
+ " if lab not in sent_counts.columns:\n",
1257
+ " sent_counts[lab] = 0\n",
1258
+ "\n",
1259
+ "sent_counts[\"total_reviews\"] = sent_counts[[\"positive\", \"neutral\", \"negative\"]].sum(axis=1)\n",
1260
+ "den = sent_counts[\"total_reviews\"].replace(0, np.nan)\n",
1261
+ "sent_counts[\"share_positive\"] = sent_counts[\"positive\"] / den\n",
1262
+ "sent_counts[\"share_neutral\"] = sent_counts[\"neutral\"] / den\n",
1263
+ "sent_counts[\"share_negative\"] = sent_counts[\"negative\"] / den\n",
1264
+ "sent_counts = sent_counts.reset_index()\n",
1265
+ "\n",
1266
+ "# --- Sales aggregation per title ---\n",
1267
+ "sales_by_title = (\n",
1268
+ " df_sales_r.dropna(subset=[\"title\"])\n",
1269
+ " .groupby(\"title\", as_index=False)\n",
1270
+ " .agg(\n",
1271
+ " months_observed=(\"month\", \"nunique\"),\n",
1272
+ " avg_units_sold=(\"units_sold\", \"mean\"),\n",
1273
+ " total_units_sold=(\"units_sold\", \"sum\"),\n",
1274
+ " )\n",
1275
+ ")\n",
1276
+ "\n",
1277
+ "# --- Hotel-level features (join sales + hotel metadata + sentiment) ---\n",
1278
+ "df_title = (\n",
1279
+ " sales_by_title\n",
1280
+ " .merge(df_books_r[[\"title\", \"price\", \"rating\"]], on=\"title\", how=\"left\")\n",
1281
+ " .merge(sent_counts[[\"title\", \"share_positive\", \"share_neutral\", \"share_negative\", \"total_reviews\"]],\n",
1282
+ " on=\"title\", how=\"left\")\n",
1283
+ ")\n",
1284
+ "\n",
1285
+ "df_title[\"avg_revenue\"] = df_title[\"avg_units_sold\"] * df_title[\"price\"]\n",
1286
+ "df_title[\"total_revenue\"] = df_title[\"total_units_sold\"] * df_title[\"price\"]\n",
1287
+ "\n",
1288
+ "df_title.to_csv(\"synthetic_title_level_features.csv\", index=False)\n",
1289
+ "print(\"βœ… Wrote synthetic_title_level_features.csv\")\n",
1290
+ "\n",
1291
+ "# --- Monthly revenue series (proxy: units_sold * price) ---\n",
1292
+ "monthly_rev = (\n",
1293
+ " df_sales_r.merge(df_books_r[[\"title\", \"price\"]], on=\"title\", how=\"left\")\n",
1294
+ ")\n",
1295
+ "monthly_rev[\"revenue\"] = monthly_rev[\"units_sold\"] * monthly_rev[\"price\"]\n",
1296
+ "\n",
1297
+ "df_monthly = (\n",
1298
+ " monthly_rev.dropna(subset=[\"month\"])\n",
1299
+ " .groupby(\"month\", as_index=False)[\"revenue\"]\n",
1300
+ " .sum()\n",
1301
+ " .rename(columns={\"revenue\": \"total_revenue\"})\n",
1302
+ " .sort_values(\"month\")\n",
1303
+ ")\n",
1304
+ "# if revenue is all NA (e.g., missing price), fallback to units_sold as a teaching proxy\n",
1305
+ "if df_monthly[\"total_revenue\"].notna().sum() == 0:\n",
1306
+ " df_monthly = (\n",
1307
+ " df_sales_r.dropna(subset=[\"month\"])\n",
1308
+ " .groupby(\"month\", as_index=False)[\"units_sold\"]\n",
1309
+ " .sum()\n",
1310
+ " .rename(columns={\"units_sold\": \"total_revenue\"})\n",
1311
+ " .sort_values(\"month\")\n",
1312
+ " )\n",
1313
+ "\n",
1314
+ "df_monthly[\"month\"] = pd.to_datetime(df_monthly[\"month\"], errors=\"coerce\").dt.strftime(\"%Y-%m-%d\")\n",
1315
+ "df_monthly.to_csv(\"synthetic_monthly_revenue_series.csv\", index=False)\n",
1316
+ "print(\"βœ… Wrote synthetic_monthly_revenue_series.csv\")\n"
1317
+ ]
1318
+ },
1319
+ {
1320
+ "cell_type": "markdown",
1321
+ "metadata": {
1322
+ "id": "RYvGyVfXuo54"
1323
+ },
1324
+ "source": [
1325
+ "### *d. βœ‹πŸ»πŸ›‘β›”οΈ View the first few lines*"
1326
+ ]
1327
+ },
1328
+ {
1329
+ "cell_type": "code",
1330
+ "execution_count": 40,
1331
+ "metadata": {
1332
+ "colab": {
1333
+ "base_uri": "https://localhost:8080/"
1334
+ },
1335
+ "id": "xfE8NMqOurKo",
1336
+ "outputId": "952335f7-1288-4af7-f32b-3f2e52e7060b"
1337
+ },
1338
+ "outputs": [
1339
+ {
1340
+ "output_type": "stream",
1341
+ "name": "stdout",
1342
+ "text": [
1343
+ " title sentiment_label \\\n",
1344
+ "0 1785 Inn negative \n",
1345
+ "1 1785 Inn negative \n",
1346
+ "2 1785 Inn negative \n",
1347
+ "3 1785 Inn negative \n",
1348
+ "4 1785 Inn negative \n",
1349
+ "\n",
1350
+ " review_text rating popularity_score \n",
1351
+ "0 I am shocked by how many good reviews this res... 2.625 2 \n",
1352
+ "1 Very Reasonably priced, Nice Pub, Great breakf... 2.625 2 \n",
1353
+ "2 to share your opinion of this businesswith YP ... 2.625 2 \n",
1354
+ "3 My wife and I ate dinner at the 1785 inn durin... 2.625 2 \n",
1355
+ "4 Billy the bartender is awesome - ask him about... 2.625 2 \n"
1356
+ ]
1357
+ }
1358
+ ],
1359
+ "source": [
1360
+ "print(df_reviews.head())"
1361
+ ]
1362
+ }
1363
+ ],
1364
+ "metadata": {
1365
+ "colab": {
1366
+ "collapsed_sections": [
1367
+ "jpASMyIQMaAq",
1368
+ "lquNYCbfL9IM",
1369
+ "0IWuNpxxYDJF",
1370
+ "oCdTsin2Yfp3",
1371
+ "T0TOeRC4Yrnn",
1372
+ "duI5dv3CZYvF",
1373
+ "qMjRKMBQZlJi",
1374
+ "p-1Pr2szaqLk",
1375
+ "SIaJUGIpaH4V",
1376
+ "pY4yCoIuaQqp",
1377
+ "n4-TaNTFgPak",
1378
+ "HnngRNTgacYt",
1379
+ "HF9F9HIzgT7Z",
1380
+ "T8AdKkmASq9a",
1381
+ "OhXbdGD5fH0c",
1382
+ "L2ak1HlcgoTe",
1383
+ "4IXZKcCSgxnq",
1384
+ "EhIjz9WohAmZ",
1385
+ "Gi4y9M9KuDWx",
1386
+ "fQhfVaDmuULT",
1387
+ "bmJMXF-Bukdm",
1388
+ "RYvGyVfXuo54"
1389
+ ],
1390
+ "provenance": []
1391
+ },
1392
+ "kernelspec": {
1393
+ "display_name": "Python 3",
1394
+ "name": "python3"
1395
+ },
1396
+ "language_info": {
1397
+ "name": "python"
1398
+ }
1399
+ },
1400
+ "nbformat": 4,
1401
+ "nbformat_minor": 0
1402
+ }
2a_Python_Analysis_GroupD5.ipynb ADDED
The diff for this file is too large to render. See raw diff