Mpavan45 commited on
Commit
85520f6
·
verified ·
1 Parent(s): 4ee666e

Rename pages/4_EDA and Feature Engineering.py to pages/3_EDA and Feature Engineering.py

Browse files
pages/3_EDA and Feature Engineering.py ADDED
@@ -0,0 +1,574 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import numpy as np
3
+ import pandas as pd
4
+ import matplotlib.pyplot as plt
5
+ import seaborn as sns
6
+ from io import StringIO
7
+ import sys
8
+
9
+ st.markdown("<h1 style='text-align:center; color:lime;'>EDA and Feature Engineering</h1>",unsafe_allow_html=True)
10
+
11
+
12
+ # Title of the Streamlit app
13
+ st.title("Exploratory Data Analysis (EDA) on Agoda Hotel Dataset")
14
+
15
+ # Introduction and Aim
16
+ st.header("Aim of the EDA")
17
+ st.write("""
18
+ The main objective of this EDA is to analyze Agoda's hotel dataset to identify key factors influencing hotel pricing strategies and customer booking preferences.
19
+ The analysis will focus on uncovering patterns, trends, and relationships in hotel ratings, pricing structures, discounts, and free services.
20
+ By leveraging these insights, Agoda can optimize its pricing strategy, predict booking preferences, and enhance revenue generation while maintaining customer satisfaction.
21
+ """)
22
+
23
+ # Description of the Data
24
+ st.header("Description of the Data")
25
+ st.write("""
26
+ **Overall Summary:** We are analyzing the Agoda dataset by performing EDA and Statistical Tests on the data that has already been cleaned through data wrangling to address any messiness or missing information.
27
+
28
+ **Table - Agoda_df:** The cleaned dataset consists of over 3,500 hotel listings, which will be used as test subjects for the hotel pricing period.
29
+
30
+ **Dataset Details:**
31
+ The dataset contains information about 3,219 hotel room listings with 12 features, each detailing aspects of the listing. Below is the description of each column:
32
+
33
+ | Column Name | Description |
34
+ |-----------------|---------------------------------------------------------------------------|
35
+ | hotel_name | Name of the hotel. |
36
+ | rating | Average customer rating of the hotel (float, range 1-5). |
37
+ | location | Address or locality of the hotel. |
38
+ | review_text | Customer feedback or comments about the hotel. |
39
+ | reviews | Total number of customer reviews for the hotel. |
40
+ | cashback | Cashback amount offered for the booking. |
41
+ | discount | Discount percentage applied to the room price. |
42
+ | free_services | Free services provided (e.g., breakfast, Wi-Fi). |
43
+ | cancellation | Cancellation policy for the booking (e.g., free, non-refundable). |
44
+ | price | Price of the room after discounts and cashback (float). |
45
+ | state | The state where the hotel is located. |
46
+ | category | Target variable representing the room type or category (e.g., budget, luxury). |
47
+ """)
48
+
49
+ # Table-wise EDA & Necessary Tests
50
+ st.header("Table-wise EDA and Necessary Statistical Tests")
51
+ st.write("""
52
+ **Agoda_df:** Cleaned dataset with hotel details and key features like ratings, price, reviews, cashback, discounts, and free services.
53
+
54
+ The EDA will involve the following steps:
55
+ - **Summary Statistics:** Analyze the central tendency, spread, and shape of the distribution of each feature.
56
+ - **Data Distribution:** Visualize the distribution of key features like price, ratings, reviews, cashback, etc.
57
+ - **Correlation Analysis:** Analyze relationships between numeric features like price, ratings, reviews, cashback, etc.
58
+ - **Categorical Data Analysis:** Explore categorical variables like hotel category, cancellation policy, state, and location using frequency tables and visualizations.
59
+ - **Missing Value Analysis:** Ensure no missing values remain, and check the need for imputations.
60
+ - **Outlier Detection:** Identify any outliers that may skew the analysis or predictions.
61
+ - **Statistical Tests:** Apply appropriate statistical tests to identify significant differences or relationships (e.g., t-tests for comparing means, chi-squared for categorical variables).
62
+ """)
63
+
64
+ # Placeholder for further detailed code or visualizations
65
+ st.write("Further steps will include generating visualizations and statistical tests to explore relationships between features in more detail.")
66
+
67
+ # Access dataset from session state
68
+ data= st.session_state.get("dataset")
69
+
70
+ if data is not None:
71
+ st.subheader("Dataset Preview:")
72
+ st.write(data) # Display the first 5 rows
73
+
74
+
75
+ st.subheader("Info of the Dataset:")
76
+ # Redirect the output of df.info() to a string buffer
77
+ buffer = StringIO()
78
+ data.info(buf=buffer)
79
+
80
+ # Display the content in Streamlit
81
+ st.write(buffer.getvalue())
82
+
83
+ st.subheader("Dataset Statistics:")
84
+ st.write(data.describe())
85
+
86
+ st.subheader("Dataset Shape (Rows, Columns):")
87
+ st.write(data.shape)
88
+
89
+ data= data[data["price"] <= 40000] # Keep rows where price is less than or equal to 40,000
90
+
91
+ pages = st.sidebar.selectbox(
92
+ ["Univariate Analysis", "Bivariate Analysis", "Multivariate Analysis"]
93
+ )
94
+ ### univariate_analysis.
95
+ st.success("Dataset successfully loaded from session state!")
96
+ ii page='Univariate Analysis':
97
+ st.subheader("Univariate Analysis")
98
+
99
+ # Rating and Review Text Distribution
100
+ st.subheader("Rating and Review Text Distribution")
101
+ fig, axs = plt.subplots(1, 2, figsize=(16, 6))
102
+
103
+ data["rating"].value_counts().plot(kind='pie', title='Distribution of Ratings', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[0])
104
+ axs[0].set_title("Distribution of Ratings")
105
+
106
+ data['review_text'].value_counts().plot(kind='pie', title='Distribution of Review Text', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[1])
107
+ axs[1].set_title("Distribution of Review Text")
108
+
109
+ st.pyplot(fig)
110
+
111
+ # Hotel Star Insights
112
+ st.write("""
113
+ **Insight:**
114
+ - Majority of hotels in this data are 3-star hotels.
115
+ - Frequency of 4-star and 5-star hotels are also moderately good.
116
+ - 1-star and 2-star hotels are lower in frequency.
117
+ """)
118
+
119
+ # Price, Cashback, and Discount Distribution
120
+ st.subheader("Price, Cashback, and Discount Distribution")
121
+ fig, axs = plt.subplots(1, 3, figsize=(16, 6))
122
+
123
+ sns.histplot(data=data, x='price', color='green', kde=True, ax=axs[0])
124
+ axs[0].set_title("Count based on price")
125
+ axs[0].set_xlabel('Price')
126
+ axs[0].set_ylabel('Number of People')
127
+
128
+ sns.histplot(data=data, x='cashback', color='violet', kde=True, ax=axs[1])
129
+ axs[1].set_title("Count based on cashback")
130
+ axs[1].set_xlabel('Cashback')
131
+ axs[1].set_ylabel('Number of People')
132
+
133
+ sns.histplot(data=data, x='discount', color='orange', kde=True, ax=axs[2])
134
+ axs[2].set_title("Count based on discount")
135
+ axs[2].set_xlabel('Discount')
136
+
137
+ st.pyplot(fig)
138
+ # Histogram Insights
139
+ st.subheader("Plot-wise Analysis of Histograms")
140
+ st.write("""
141
+ **Price Distribution Insight:**
142
+ - The histogram is right-skewed, showing most properties are in the lower price range.
143
+ - A long tail indicates the presence of a few very expensive properties.
144
+
145
+ **Cashback Distribution Insight:**
146
+ - The histogram is right-skewed, with the majority of properties offering lower cashback amounts.
147
+ - Only a small number of properties provide higher cashback.
148
+
149
+ **Discount Distribution Insight:**
150
+ - The histogram is right-skewed, indicating that most properties offer lower discount percentages.
151
+ - A few properties stand out with higher discounts.
152
+
153
+ **Summary:**
154
+ The data suggest that Agoda properties are generally affordable, with lower cashback and discount offers being common. Further statistical analysis could help uncover more detailed insights.
155
+ """)
156
+
157
+ # Cancellation and State Distribution
158
+ st.subheader("Cancellation and State Distribution")
159
+ fig, axs = plt.subplots(1, 2, figsize=(16, 6))
160
+
161
+ data["cancellation"].value_counts().plot(kind='bar', title='Distribution of Cancellation', color='red', ax=axs[0])
162
+ axs[0].set_title("Distribution of Cancellation")
163
+ axs[0].set_xlabel('Cancellation')
164
+ axs[0].set_ylabel('Number of Hotels')
165
+
166
+ data["state"].value_counts().plot(kind='bar', title='Distribution of State', color='black', ax=axs[1])
167
+ axs[1].set_title("Distribution of State")
168
+ axs[1].set_xlabel('State')
169
+ axs[1].set_ylabel('Number of Hotels')
170
+
171
+ st.pyplot(fig)
172
+ # Bar Chart Insights
173
+ st.subheader("Plot Wise Analysis of Bar Charts")
174
+ st.write("""
175
+ **Cancellations:**
176
+ - Most cancellations fall under category "1," indicating they occur within specific conditions or timeframes.
177
+
178
+ **State Distribution:**
179
+ - "Maharashtra" has the highest number of hotels, followed by "Madhya Pradesh."
180
+ - Other states like Gujarat, Karnataka, and Kerala also have notable hotel counts.
181
+ - The distribution is uneven, with some states having significantly more hotels.
182
+
183
+ **Summary:**
184
+ The charts highlight cancellation trends and the regional hotel distribution in India.
185
+ """)
186
+
187
+ # Category and Reviews Distribution
188
+ st.subheader("Category and Reviews Distribution")
189
+ fig, axs = plt.subplots(1, 2, figsize=(16, 6))
190
+
191
+ colors = sns.color_palette('Set2', n_colors=len(data["category"].value_counts()))
192
+ data["category"].value_counts().plot(kind='bar', ax=axs[0], color=colors)
193
+ axs[0].set_title("Distribution of Category")
194
+ axs[0].set_xlabel('Category')
195
+ axs[0].set_ylabel('Number of Hotels')
196
+
197
+ sns.histplot(data=data, x='reviews', color='violet', kde=True, ax=axs[1])
198
+ axs[1].set_title("Count based on Reviews")
199
+ axs[1].set_xlabel('Reviews')
200
+ axs[1].set_ylabel('Number of Reviews')
201
+
202
+ st.pyplot(fig)
203
+ # Hotel Categories and Reviews Insights
204
+ st.subheader("Plot Wise Analysis of Hotel Categories and Reviews")
205
+ st.write("""
206
+ **Category Distribution:**
207
+ - The histogram shows "Low Budget" hotels are the most common, followed by "Budget Hotels," while "Luxury Hotels" are the least common.
208
+
209
+ **Review Count Distribution:**
210
+ - The histogram is right-skewed, with most hotels having a low number of reviews.
211
+ - A few hotels have a very high number of reviews, evident from the long tail.
212
+
213
+ **Summary:**
214
+ The data indicates a higher concentration of low-budget hotels and relatively low review counts for most hotels.
215
+ """)
216
+
217
+ # Top 10 Amenities
218
+ st.subheader("Top 10 Amenities")
219
+ amenity_counts = data['free_services'].str.split(',').explode().str.strip().value_counts().reset_index()
220
+ amenity_counts.columns = ['Amenity', 'Count']
221
+
222
+ fig, ax = plt.subplots(figsize=(10, 6))
223
+ sns.barplot(x='Count', y='Amenity', data=amenity_counts.head(10), palette='viridis', ax=ax)
224
+ ax.set_title('Top 10 Amenities')
225
+ ax.set_xlabel('Number of Hotels Offering')
226
+ ax.set_ylabel('Amenity')
227
+
228
+ st.pyplot(fig)
229
+ # Top Amenities Insights
230
+ st.subheader("Plot Wise Analysis of Top Amenities")
231
+ st.write("""
232
+ **Common Amenities:**
233
+ - Complimentary Parking is the most frequently offered amenity.
234
+ - Basic Toiletries and Hair Dryers are also widely available.
235
+
236
+ **Less Common Amenities:**
237
+ - Fitness Center Access, Welcome Drinks, and Turndown Service are less common.
238
+ - Shoe Shine Service is the least frequently offered amenity.
239
+
240
+ **Summary:**
241
+ Hotels tend to prioritize basic amenities like parking, toiletries, and hair dryers, while luxurious amenities are offered less frequently.
242
+ """)
243
+ if pages="Bivariate Analysis":
244
+ # Streamlit app
245
+ st.title("Bivariate Analysis")
246
+
247
+ # Price vs Rating scatter plot
248
+ st.subheader("Price vs Rating")
249
+ fig, ax = plt.subplots(figsize=(7, 5))
250
+ sns.scatterplot(x='rating', y='price', data=data, color='orange')
251
+ ax.set_title('Price vs Rating')
252
+ ax.set_xlabel('Rating')
253
+ ax.set_ylabel('Price')
254
+ st.pyplot(fig)
255
+
256
+ # Price vs Rating
257
+ st.subheader("Price vs Rating:")
258
+ st.write("""
259
+ - **Analysis:**
260
+ - Higher-priced hotels slightly tend to have better ratings, but ratings vary widely across price points.
261
+ - Hotels at various price points exhibit a large spread in ratings, meaning factors other than price (such as customer experience or amenities) contribute significantly to the rating.
262
+ """)
263
+
264
+ # Price vs Discount scatter plot
265
+ st.subheader("Price vs Discount")
266
+ fig, ax = plt.subplots(figsize=(7, 5))
267
+ sns.scatterplot(x='discount', y='price', data=data, color='green')
268
+ ax.set_title('Price vs Discount')
269
+ ax.set_xlabel('Discount')
270
+ ax.set_ylabel('Price')
271
+ st.pyplot(fig)
272
+ # Price vs Discount
273
+ st.subheader("Price vs Discount:")
274
+ st.write("""
275
+ - **Analysis:**
276
+ - Some high-priced hotels still provide discounts due to promotions or special deals.
277
+ - This observation suggests that while premium hotels may not always need to offer discounts to attract customers, they occasionally use them as a marketing strategy or for seasonal promotions.
278
+ """)
279
+
280
+ # Price vs Cashback scatter plot
281
+ st.subheader("Price vs Cashback")
282
+ fig, ax = plt.subplots(figsize=(7, 5))
283
+ sns.scatterplot(x='cashback', y='price', data=data, color='blue')
284
+ ax.set_title('Price vs Cashback')
285
+ ax.set_xlabel('Cashback')
286
+ ax.set_ylabel('Price')
287
+ st.pyplot(fig)
288
+
289
+ # Price vs Cashback
290
+ st.subheader("Price vs Cashback:")
291
+ st.write("""
292
+ - **Analysis:**
293
+ - Exceptions exist due to promotional campaigns.
294
+ - While higher-priced hotels generally offer fewer cashback incentives, some may offer cashback due to specific promotional campaigns aimed at increasing sales volume or attracting customers in a competitive market.
295
+ """)
296
+
297
+ # Price vs Category bar plot
298
+ st.subheader("Price vs Category")
299
+ fig, ax = plt.subplots(figsize=(7, 5))
300
+ sns.barplot(x='category', y='price', data=data, palette='Set2')
301
+ ax.set_title('Price vs Category')
302
+ ax.set_xlabel('Category')
303
+ ax.set_ylabel('Price')
304
+ st.pyplot(fig)
305
+ # Price vs Category
306
+ st.subheader("Price vs Category:")
307
+ st.write("""
308
+ - **Analysis:**
309
+ - "Luxury" hotels have the highest prices, followed by "Premium" and "Free & Easy."
310
+ - "Low Budget" and "Budget" hotels occupy the lower price range, showing that the category directly influences pricing strategy.
311
+ - Categories like "Luxury" and "Premium" aim to target a specific market willing to pay more for superior quality and services, while "Budget" and "Low Budget" cater to a price-sensitive segment.
312
+ """)
313
+ # Summary of Scatter Plot Analysis
314
+ st.subheader("Summary of Scatter Plot Analysis:")
315
+ st.write("""
316
+ - **Price vs Rating:** Higher-priced hotels generally offer better ratings, but the ratings vary widely across price points, indicating that factors such as service quality and amenities matter significantly.
317
+ - **Price vs Discount:** High-priced hotels may still offer discounts due to seasonal promotions or special offers.
318
+ - **Price vs Cashback:** Although high-priced hotels generally offer fewer cashback incentives, there are exceptions driven by promotional campaigns.
319
+ - **Price vs Category:** "Luxury" hotels are the most expensive, followed by "Premium" and "Free & Easy" categories. On the other hand, "Low Budget" and "Budget" hotels have lower prices.
320
+ - **Overall Insight:** The scatter plots reveal trends where higher-priced hotels tend to offer better ratings but fewer discounts and cashback incentives, while lower-priced categories tend to provide more promotional benefits such as discounts and cashback.
321
+ """)
322
+
323
+ # Rating vs Category bar plot
324
+ st.subheader("Rating vs Category")
325
+ fig, ax = plt.subplots(figsize=(7, 5))
326
+ sns.barplot(x='category', y='rating', data=data, palette='Set1')
327
+ ax.set_title('Rating vs Category')
328
+ ax.set_xlabel('Category')
329
+ ax.set_ylabel('Rating')
330
+ st.pyplot(fig)
331
+ # Rating vs Category
332
+ st.subheader("Rating vs Category:")
333
+ st.write("""
334
+ - **Analysis:**
335
+ - "Luxury" hotels lead in average ratings, followed by "Premium" hotels.
336
+ - "Budget" and "Low Budget" categories show lower average ratings, indicating that these hotels may focus on price rather than offering premium services or experiences.
337
+ - The disparity in ratings shows that customers tend to have higher expectations for luxury and premium accommodations, which are reflected in the ratings.
338
+ """)
339
+
340
+ # Discount vs Category box plot
341
+ st.subheader("Discount vs Category")
342
+ fig, ax = plt.subplots(figsize=(7, 5))
343
+ sns.boxplot(x='category', y='discount', data=data, palette='Set2')
344
+ ax.set_title('Discount vs Category')
345
+ ax.set_xlabel('Category')
346
+ ax.set_ylabel('Discount')
347
+ st.pyplot(fig)
348
+ # Discount vs Category
349
+ st.subheader("Discount vs Category:")
350
+ st.write("""
351
+ - **Analysis:**
352
+ - "Low Budget" hotels offer the highest discounts, while "Luxury" hotels provide the least discounts.
353
+ - This suggests that budget-friendly hotels use discounts as a key strategy to attract price-sensitive customers, whereas luxury hotels focus on providing premium experiences without relying on price reductions.
354
+ - The discount strategy varies by category, with lower-priced categories incentivizing customers with discounts to stay competitive.
355
+ """)
356
+ # Cashback vs Category violin plot
357
+ st.subheader("Cashback vs Category")
358
+ fig, ax = plt.subplots(figsize=(7, 5))
359
+ sns.violinplot(x='category', y='cashback', data=data, palette='Set3')
360
+ ax.set_title('Cashback vs Category')
361
+ ax.set_xlabel('Category')
362
+ ax.set_ylabel('Cashback')
363
+ st.pyplot(fig)
364
+ # Cashback vs Category
365
+ st.subheader("Cashback vs Category:")
366
+ st.write("""
367
+ - **Analysis:**
368
+ - Higher cashback offers are more common in "Low Budget" hotels, as these hotels rely on cashback incentives to attract customers looking for value deals.
369
+ - Luxury hotels rarely provide cashback, as their target market is less likely to be motivated by such offers.
370
+ - The trend highlights the different strategies employed by each category: budget options often provide financial incentives like cashback to drive bookings, while luxury options focus on premium services.
371
+ """)
372
+
373
+ # Reviews vs Category count plot
374
+ st.subheader("Reviews vs Category")
375
+ fig, ax = plt.subplots(figsize=(7, 5))
376
+ sns.countplot(x='category', data=data, palette='Set1')
377
+ ax.set_title('Reviews vs Category (Count)')
378
+ ax.set_xlabel('Category')
379
+ ax.set_ylabel('Count of Reviews')
380
+ st.pyplot(fig)
381
+ # Reviews vs Category
382
+ st.subheader("Reviews vs Category:")
383
+ st.write("""
384
+ - **Analysis:**
385
+ - "Luxury" hotels attract the most reviews, indicating that higher-quality accommodations often receive more feedback from customers.
386
+ - "Budget" and "Low Budget" hotels tend to have fewer reviews, which may be due to their more straightforward offerings and smaller customer base.
387
+ - This trend suggests that customers who opt for luxury hotels are more likely to share their experiences, whereas budget options may attract fewer repeat customers or have less word-of-mouth influence.
388
+ """)
389
+ # Summary of Bar and Box Plot Analysis
390
+ st.subheader("Summary of Bar and Box Plot Analysis:")
391
+ st.write("""
392
+ - **Rating vs Category:** "Luxury" and "Premium" hotels have higher ratings on average, while "Budget" and "Low Budget" hotels show lower ratings.
393
+ - **Discount vs Category:** Budget hotels, especially "Low Budget," offer higher discounts, while luxury hotels offer fewer discounts, relying more on their value proposition.
394
+ - **Cashback vs Category:** "Low Budget" hotels offer higher cashback incentives, while luxury hotels rarely provide cashback, highlighting the pricing strategies in different categories.
395
+ - **Reviews vs Category:** "Luxury" hotels attract the most reviews, while "Budget" and "Low Budget" hotels attract fewer.
396
+ - **Overall Insight:** Bar and box plots reveal that higher-rated and more reviewed hotels tend to offer fewer discounts or cashback, focusing on customer experience, while budget categories focus on providing customer incentives to compete in the market.
397
+ """)
398
+ # Regional price analysis by state
399
+ st.subheader("Price by State")
400
+ fig, ax = plt.subplots(figsize=(16, 6))
401
+ sns.barplot(data=data, x='state', y='price', ax=ax, color='green')
402
+ ax.set_title('Price by State')
403
+ ax.tick_params(axis='x', rotation=90)
404
+ sns.set_palette('magma')
405
+ plt.tight_layout()
406
+ st.pyplot(fig)
407
+ # Hotel Prices Across Indian States
408
+ st.subheader("Hotel Prices Across Indian States:")
409
+ st.write("""
410
+ - **Analysis:**
411
+ - Hotel prices vary significantly across Indian states, reflecting regional differences in demand and supply.
412
+ - States with popular tourist destinations, such as Goa and Rajasthan, tend to show higher hotel prices, as they attract more visitors and have a higher demand for accommodations.
413
+ - Conversely, states with less tourism or lower demand may exhibit more affordable pricing, catering to the local population or budget travelers.
414
+ - Price differences also reflect factors such as local economic conditions, infrastructure, and tourism policies in each state.
415
+ """)
416
+
417
+ # Regional category count by state
418
+ st.subheader("Category by State")
419
+ fig, ax = plt.subplots(figsize=(16, 6))
420
+ sns.countplot(data=data, x='state', hue='category', ax=ax, palette='Set1')
421
+ ax.set_title('Category by State')
422
+ ax.tick_params(axis='x', rotation=90)
423
+ plt.tight_layout()
424
+ st.pyplot(fig)
425
+ # Hotel Categories by State
426
+ st.subheader("Hotel Categories by State:")
427
+ st.write("""
428
+ - **Analysis:**
429
+ - States with a higher concentration of "Low Budget" and "Budget" hotels cater primarily to cost-conscious travelers, offering affordable accommodations for a wide range of customers.
430
+ - States with more "Luxury" hotels are likely to be major tourist hubs, such as Delhi, Mumbai, and Kerala, or regions that cater to premium audiences, offering high-end services for affluent customers.
431
+ - These states may also focus on attracting international tourists or business travelers who prefer premium amenities and luxury experiences.
432
+ """)
433
+
434
+ # Summary of Regional Price and Category Trends
435
+ st.subheader("Summary of Regional Price and Category Trends:")
436
+ st.write("""
437
+ - **Hotel Prices Across Indian States:** Prices vary significantly depending on the state, with tourist-heavy regions showing higher prices due to greater demand.
438
+ - **Hotel Categories by State:** States with more budget hotels focus on catering to price-sensitive travelers, while states with luxury hotels cater to premium audiences, often in tourist hotspots.
439
+ - **Overall Insight:** Regional trends indicate diverse pricing and category distributions, influenced by tourism, regional economics, and state-specific factors that shape hotel offerings across the country.
440
+ """)
441
+ if pages="Multivariate Analysis":
442
+ st.title("Multivariate Analysis of Hotel Data")
443
+
444
+ # Create a subset of the data for the analysis
445
+ subset_data = data[['category', 'price', 'reviews', 'discount', 'cashback', 'rating']]
446
+
447
+ # Section 1: Price vs. Reviews by Category
448
+ st.header("Price vs. Reviews by Category")
449
+ fig1 = sns.catplot(data=data, x='reviews', y='price', hue='category', kind='strip', palette='Set2', height=6, aspect=1.5)
450
+ fig1.set_axis_labels("Reviews", "Price")
451
+ fig1.fig.suptitle('Price vs Reviews by Category', fontsize=16)
452
+ st.pyplot(fig1)
453
+ # Analysis Text for Price vs. Reviews by Category
454
+ st.write("""
455
+ - **Price Variation within Categories:**
456
+ - Wide price ranges exist within each category, with "Low Budget" hotels featuring both low- and high-priced options.
457
+ - This shows that pricing within categories isn't always uniform and may depend on other factors like location, amenities, and hotel size.
458
+
459
+ - **Price and Reviews Relationship:**
460
+ - There's a slight tendency for hotels with more reviews to have higher prices, possibly due to the influence of popularity, better marketing efforts, or higher quality services.
461
+
462
+ - **Summary:**
463
+ - The stripplot reveals a weak positive correlation between price and reviews, indicating that well-reviewed hotels tend to have higher prices.
464
+ """)
465
+
466
+ # Section 2: Price vs. Discount by Category
467
+ st.header("Price vs. Discount by Category")
468
+ fig2 = sns.catplot(data=data, x='discount', y='price', hue='category', kind='bar', palette='Set2', height=6, aspect=1.5)
469
+ fig2.set_axis_labels("Discount", "Price")
470
+ fig2.fig.suptitle('Price vs Discount by Category', fontsize=16)
471
+ st.pyplot(fig2)
472
+ # Analysis Text for Price vs. Discount by Category
473
+ st.write("""
474
+ - **Price and Discount Relationship:**
475
+ - The stripplot clearly shows that as hotel prices increase, discounts tend to decrease. This confirms a negative correlation between price and discount.
476
+
477
+ - **Category-Specific Trends:**
478
+ - "Low Budget" and "Budget" hotels offer much higher discounts compared to "Premium" and "Luxury" hotels, which typically offer fewer or smaller discounts.
479
+ - This trend highlights that budget-conscious categories use higher discounts to attract customers, whereas premium and luxury categories rely on factors other than discounts (e.g., quality, exclusivity) to appeal to their clientele.
480
+
481
+ - **Summary:**
482
+ - The plot confirms that lower-priced hotels use higher discounts to attract customers, while premium and luxury hotels maintain lower discount rates, aligning with typical market behavior.
483
+ """)
484
+
485
+
486
+ # Section 3: Price vs Cashback and Rating by Category (Stripplot)
487
+ st.header("Price vs Cashback and Rating by Category")
488
+ fig3, axes2 = plt.subplots(1, 2, figsize=(16, 6))
489
+
490
+ sns.stripplot(data=data, x='cashback', y='price', hue='category', ax=axes2[0], palette='Set2', jitter=True, dodge=True)
491
+ axes2[0].set_title('Price vs Cashback by Category')
492
+
493
+ sns.stripplot(data=data, x='rating', y='price', hue='category', ax=axes2[1], palette='Set2', jitter=True, dodge=True)
494
+ axes2[1].set_title('Price vs Rating by Category')
495
+
496
+ st.pyplot(fig3)
497
+
498
+
499
+ # Analysis Text for Price vs Cashback by Category
500
+ st.write("""
501
+ - **Price vs Cashback by Category:**
502
+ - The plot shows that cashback incentives tend to decrease as hotel prices increase. However, variations exist due to promotional offers that may affect cashback amounts.
503
+ - Lower-priced categories, such as "Budget" and "Low Budget," offer higher cashback incentives to attract price-sensitive customers.
504
+
505
+ - **Summary:**
506
+ - The stripplot reveals a clear trend where lower-priced hotels use cashback as an incentive to boost bookings, whereas higher-priced hotels focus on other value propositions (e.g., premium services).
507
+ """)
508
+ # Analysis Text for Price vs Rating by Category
509
+ st.write("""
510
+ - **Price vs Rating by Category:**
511
+ - Higher-priced categories like "Premium" and "Luxury" tend to have better ratings, indicating that customers perceive these hotels as providing superior value.
512
+ - Interestingly, some lower-priced hotels achieve high ratings, suggesting that other factors such as service quality and customer experience may contribute to higher satisfaction despite the lower price point.
513
+
514
+ - **Summary:**
515
+ - The stripplot shows that while higher-priced hotels emphasize quality to achieve better ratings, lower-priced hotels still manage to deliver satisfactory experiences for customers through factors like service quality.
516
+ """)
517
+ # Section 4: Correlation Heatmap
518
+ st.header("Correlation Matrix Heatmap")
519
+ numeric_columns = ['price', 'reviews', 'discount', 'cashback', 'rating']
520
+
521
+ # Compute the correlation matrix
522
+ correlation_matrix = data[numeric_columns].corr()
523
+
524
+ # Create a heatmap to visualize the correlation matrix
525
+ plt.figure(figsize=(10, 8))
526
+ sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
527
+
528
+ # Set title for the plot
529
+ plt.title('Correlation Matrix Heatmap')
530
+
531
+ # Display the plot
532
+ st.pyplot(plt)
533
+
534
+ # Analysis Text for Correlation Heatmap
535
+ st.write("""
536
+ - **Price vs Discount/Cashback:**
537
+ - The heatmap shows a strong negative correlation between price and both discount and cashback. Higher-priced hotels tend to offer fewer discounts and cashback incentives, suggesting that premium offerings rely on value rather than incentives.
538
+
539
+ - **Price vs Rating:**
540
+ - A weak positive correlation is observed between price and rating. Higher-priced hotels generally have slightly better ratings, although this relationship is not very strong.
541
+
542
+ - **Reviews vs Rating:**
543
+ - A moderate positive correlation exists between reviews and ratings. Hotels that attract more reviews tend to have better ratings, likely due to a larger customer base providing feedback.
544
+
545
+ - **Reviews vs Price:**
546
+ - A weak positive correlation between reviews and price suggests that well-reviewed hotels are often priced higher, likely due to their popularity and perceived value.
547
+
548
+ - **Summary:**
549
+ - The heatmap provides insights into the relationships between key variables in the dataset. Higher-priced hotels tend to offer fewer discounts and cashback offers but generally have better ratings and more reviews, indicating that customers are willing to pay a premium for a better experience.
550
+ """)
551
+
552
+ st.header("Overall Summary")
553
+ st.write("""
554
+ - Most properties are affordable, with lower prices, cashback, and discounts dominating the dataset.
555
+ - Regional distribution shows states like Maharashtra and Madhya Pradesh having more hotels.
556
+ - The data reflects a market focused on affordability and basic amenities, with regional and category-specific variations.
557
+ - Cancellations and reviews provide further insights into customer behavior, while skewed distributions highlight potential outliers and trends in pricing and service offerings.
558
+ """)
559
+
560
+ st.header("Why Right-Skewed Trends Are Normal, Not Outliers")
561
+ st.write("""
562
+ - Right-skewed distributions for price, cashback, discounts, cancellations, reviews, and amenities are normal trends in the market.
563
+ - These trends represent the expected distribution of data where higher values are less frequent but are not considered outliers.
564
+ - The variations in cancellation patterns and review counts reflect typical customer behavior and industry dynamics.
565
+ """)
566
+ st.write("""
567
+ Since no outliers were detected, we can proceed with model training and selection.
568
+ With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
569
+ """)
570
+
571
+
572
+ else:
573
+ st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")
574
+
pages/4_EDA and Feature Engineering.py DELETED
@@ -1,572 +0,0 @@
1
- import streamlit as st
2
- import numpy as np
3
- import pandas as pd
4
- import matplotlib.pyplot as plt
5
- import seaborn as sns
6
- from io import StringIO
7
- import sys
8
-
9
- st.markdown("<h1 style='text-align:center; color:lime;'>EDA and Feature Engineering</h1>",unsafe_allow_html=True)
10
-
11
-
12
- # Title of the Streamlit app
13
- st.title("Exploratory Data Analysis (EDA) on Agoda Hotel Dataset")
14
-
15
- # Introduction and Aim
16
- st.header("Aim of the EDA")
17
- st.write("""
18
- The main objective of this EDA is to analyze Agoda's hotel dataset to identify key factors influencing hotel pricing strategies and customer booking preferences.
19
- The analysis will focus on uncovering patterns, trends, and relationships in hotel ratings, pricing structures, discounts, and free services.
20
- By leveraging these insights, Agoda can optimize its pricing strategy, predict booking preferences, and enhance revenue generation while maintaining customer satisfaction.
21
- """)
22
-
23
- # Description of the Data
24
- st.header("Description of the Data")
25
- st.write("""
26
- **Overall Summary:** We are analyzing the Agoda dataset by performing EDA and Statistical Tests on the data that has already been cleaned through data wrangling to address any messiness or missing information.
27
-
28
- **Table - Agoda_df:** The cleaned dataset consists of over 3,500 hotel listings, which will be used as test subjects for the hotel pricing period.
29
-
30
- **Dataset Details:**
31
- The dataset contains information about 3,219 hotel room listings with 12 features, each detailing aspects of the listing. Below is the description of each column:
32
-
33
- | Column Name | Description |
34
- |-----------------|---------------------------------------------------------------------------|
35
- | hotel_name | Name of the hotel. |
36
- | rating | Average customer rating of the hotel (float, range 1-5). |
37
- | location | Address or locality of the hotel. |
38
- | review_text | Customer feedback or comments about the hotel. |
39
- | reviews | Total number of customer reviews for the hotel. |
40
- | cashback | Cashback amount offered for the booking. |
41
- | discount | Discount percentage applied to the room price. |
42
- | free_services | Free services provided (e.g., breakfast, Wi-Fi). |
43
- | cancellation | Cancellation policy for the booking (e.g., free, non-refundable). |
44
- | price | Price of the room after discounts and cashback (float). |
45
- | state | The state where the hotel is located. |
46
- | category | Target variable representing the room type or category (e.g., budget, luxury). |
47
- """)
48
-
49
- # Table-wise EDA & Necessary Tests
50
- st.header("Table-wise EDA and Necessary Statistical Tests")
51
- st.write("""
52
- **Agoda_df:** Cleaned dataset with hotel details and key features like ratings, price, reviews, cashback, discounts, and free services.
53
-
54
- The EDA will involve the following steps:
55
- - **Summary Statistics:** Analyze the central tendency, spread, and shape of the distribution of each feature.
56
- - **Data Distribution:** Visualize the distribution of key features like price, ratings, reviews, cashback, etc.
57
- - **Correlation Analysis:** Analyze relationships between numeric features like price, ratings, reviews, cashback, etc.
58
- - **Categorical Data Analysis:** Explore categorical variables like hotel category, cancellation policy, state, and location using frequency tables and visualizations.
59
- - **Missing Value Analysis:** Ensure no missing values remain, and check the need for imputations.
60
- - **Outlier Detection:** Identify any outliers that may skew the analysis or predictions.
61
- - **Statistical Tests:** Apply appropriate statistical tests to identify significant differences or relationships (e.g., t-tests for comparing means, chi-squared for categorical variables).
62
- """)
63
-
64
- # Placeholder for further detailed code or visualizations
65
- st.write("Further steps will include generating visualizations and statistical tests to explore relationships between features in more detail.")
66
-
67
- # Access dataset from session state
68
- data= st.session_state.get("dataset")
69
-
70
- if data is not None:
71
- st.subheader("Dataset Preview:")
72
- st.write(data) # Display the first 5 rows
73
-
74
-
75
- st.subheader("Info of the Dataset:")
76
- # Redirect the output of df.info() to a string buffer
77
- buffer = StringIO()
78
- data.info(buf=buffer)
79
-
80
- # Display the content in Streamlit
81
- st.write(buffer.getvalue())
82
-
83
- st.subheader("Dataset Statistics:")
84
- st.write(data.describe())
85
-
86
- st.subheader("Dataset Shape (Rows, Columns):")
87
- st.write(data.shape)
88
-
89
- data= data[data["price"] <= 40000] # Keep rows where price is less than or equal to 40,000
90
-
91
-
92
- ### univariate_analysis.
93
- st.success("Dataset successfully loaded from session state!")
94
-
95
- st.subheader("Univariate Analysis")
96
-
97
- # Rating and Review Text Distribution
98
- st.subheader("Rating and Review Text Distribution")
99
- fig, axs = plt.subplots(1, 2, figsize=(16, 6))
100
-
101
- data["rating"].value_counts().plot(kind='pie', title='Distribution of Ratings', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[0])
102
- axs[0].set_title("Distribution of Ratings")
103
-
104
- data['review_text'].value_counts().plot(kind='pie', title='Distribution of Review Text', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[1])
105
- axs[1].set_title("Distribution of Review Text")
106
-
107
- st.pyplot(fig)
108
-
109
- # Hotel Star Insights
110
- st.write("""
111
- **Insight:**
112
- - Majority of hotels in this data are 3-star hotels.
113
- - Frequency of 4-star and 5-star hotels are also moderately good.
114
- - 1-star and 2-star hotels are lower in frequency.
115
- """)
116
-
117
- # Price, Cashback, and Discount Distribution
118
- st.subheader("Price, Cashback, and Discount Distribution")
119
- fig, axs = plt.subplots(1, 3, figsize=(16, 6))
120
-
121
- sns.histplot(data=data, x='price', color='green', kde=True, ax=axs[0])
122
- axs[0].set_title("Count based on price")
123
- axs[0].set_xlabel('Price')
124
- axs[0].set_ylabel('Number of People')
125
-
126
- sns.histplot(data=data, x='cashback', color='violet', kde=True, ax=axs[1])
127
- axs[1].set_title("Count based on cashback")
128
- axs[1].set_xlabel('Cashback')
129
- axs[1].set_ylabel('Number of People')
130
-
131
- sns.histplot(data=data, x='discount', color='orange', kde=True, ax=axs[2])
132
- axs[2].set_title("Count based on discount")
133
- axs[2].set_xlabel('Discount')
134
-
135
- st.pyplot(fig)
136
- # Histogram Insights
137
- st.subheader("Plot-wise Analysis of Histograms")
138
- st.write("""
139
- **Price Distribution Insight:**
140
- - The histogram is right-skewed, showing most properties are in the lower price range.
141
- - A long tail indicates the presence of a few very expensive properties.
142
-
143
- **Cashback Distribution Insight:**
144
- - The histogram is right-skewed, with the majority of properties offering lower cashback amounts.
145
- - Only a small number of properties provide higher cashback.
146
-
147
- **Discount Distribution Insight:**
148
- - The histogram is right-skewed, indicating that most properties offer lower discount percentages.
149
- - A few properties stand out with higher discounts.
150
-
151
- **Summary:**
152
- The data suggest that Agoda properties are generally affordable, with lower cashback and discount offers being common. Further statistical analysis could help uncover more detailed insights.
153
- """)
154
-
155
- # Cancellation and State Distribution
156
- st.subheader("Cancellation and State Distribution")
157
- fig, axs = plt.subplots(1, 2, figsize=(16, 6))
158
-
159
- data["cancellation"].value_counts().plot(kind='bar', title='Distribution of Cancellation', color='red', ax=axs[0])
160
- axs[0].set_title("Distribution of Cancellation")
161
- axs[0].set_xlabel('Cancellation')
162
- axs[0].set_ylabel('Number of Hotels')
163
-
164
- data["state"].value_counts().plot(kind='bar', title='Distribution of State', color='black', ax=axs[1])
165
- axs[1].set_title("Distribution of State")
166
- axs[1].set_xlabel('State')
167
- axs[1].set_ylabel('Number of Hotels')
168
-
169
- st.pyplot(fig)
170
- # Bar Chart Insights
171
- st.subheader("Plot Wise Analysis of Bar Charts")
172
- st.write("""
173
- **Cancellations:**
174
- - Most cancellations fall under category "1," indicating they occur within specific conditions or timeframes.
175
-
176
- **State Distribution:**
177
- - "Maharashtra" has the highest number of hotels, followed by "Madhya Pradesh."
178
- - Other states like Gujarat, Karnataka, and Kerala also have notable hotel counts.
179
- - The distribution is uneven, with some states having significantly more hotels.
180
-
181
- **Summary:**
182
- The charts highlight cancellation trends and the regional hotel distribution in India.
183
- """)
184
-
185
- # Category and Reviews Distribution
186
- st.subheader("Category and Reviews Distribution")
187
- fig, axs = plt.subplots(1, 2, figsize=(16, 6))
188
-
189
- colors = sns.color_palette('Set2', n_colors=len(data["category"].value_counts()))
190
- data["category"].value_counts().plot(kind='bar', ax=axs[0], color=colors)
191
- axs[0].set_title("Distribution of Category")
192
- axs[0].set_xlabel('Category')
193
- axs[0].set_ylabel('Number of Hotels')
194
-
195
- sns.histplot(data=data, x='reviews', color='violet', kde=True, ax=axs[1])
196
- axs[1].set_title("Count based on Reviews")
197
- axs[1].set_xlabel('Reviews')
198
- axs[1].set_ylabel('Number of Reviews')
199
-
200
- st.pyplot(fig)
201
- # Hotel Categories and Reviews Insights
202
- st.subheader("Plot Wise Analysis of Hotel Categories and Reviews")
203
- st.write("""
204
- **Category Distribution:**
205
- - The histogram shows "Low Budget" hotels are the most common, followed by "Budget Hotels," while "Luxury Hotels" are the least common.
206
-
207
- **Review Count Distribution:**
208
- - The histogram is right-skewed, with most hotels having a low number of reviews.
209
- - A few hotels have a very high number of reviews, evident from the long tail.
210
-
211
- **Summary:**
212
- The data indicates a higher concentration of low-budget hotels and relatively low review counts for most hotels.
213
- """)
214
-
215
- # Top 10 Amenities
216
- st.subheader("Top 10 Amenities")
217
- amenity_counts = data['free_services'].str.split(',').explode().str.strip().value_counts().reset_index()
218
- amenity_counts.columns = ['Amenity', 'Count']
219
-
220
- fig, ax = plt.subplots(figsize=(10, 6))
221
- sns.barplot(x='Count', y='Amenity', data=amenity_counts.head(10), palette='viridis', ax=ax)
222
- ax.set_title('Top 10 Amenities')
223
- ax.set_xlabel('Number of Hotels Offering')
224
- ax.set_ylabel('Amenity')
225
-
226
- st.pyplot(fig)
227
- # Top Amenities Insights
228
- st.subheader("Plot Wise Analysis of Top Amenities")
229
- st.write("""
230
- **Common Amenities:**
231
- - Complimentary Parking is the most frequently offered amenity.
232
- - Basic Toiletries and Hair Dryers are also widely available.
233
-
234
- **Less Common Amenities:**
235
- - Fitness Center Access, Welcome Drinks, and Turndown Service are less common.
236
- - Shoe Shine Service is the least frequently offered amenity.
237
-
238
- **Summary:**
239
- Hotels tend to prioritize basic amenities like parking, toiletries, and hair dryers, while luxurious amenities are offered less frequently.
240
- """)
241
-
242
- # Streamlit app
243
- st.title("Bivariate Analysis")
244
-
245
- # Price vs Rating scatter plot
246
- st.subheader("Price vs Rating")
247
- fig, ax = plt.subplots(figsize=(7, 5))
248
- sns.scatterplot(x='rating', y='price', data=data, color='orange')
249
- ax.set_title('Price vs Rating')
250
- ax.set_xlabel('Rating')
251
- ax.set_ylabel('Price')
252
- st.pyplot(fig)
253
-
254
- # Price vs Rating
255
- st.subheader("Price vs Rating:")
256
- st.write("""
257
- - **Analysis:**
258
- - Higher-priced hotels slightly tend to have better ratings, but ratings vary widely across price points.
259
- - Hotels at various price points exhibit a large spread in ratings, meaning factors other than price (such as customer experience or amenities) contribute significantly to the rating.
260
- """)
261
-
262
- # Price vs Discount scatter plot
263
- st.subheader("Price vs Discount")
264
- fig, ax = plt.subplots(figsize=(7, 5))
265
- sns.scatterplot(x='discount', y='price', data=data, color='green')
266
- ax.set_title('Price vs Discount')
267
- ax.set_xlabel('Discount')
268
- ax.set_ylabel('Price')
269
- st.pyplot(fig)
270
- # Price vs Discount
271
- st.subheader("Price vs Discount:")
272
- st.write("""
273
- - **Analysis:**
274
- - Some high-priced hotels still provide discounts due to promotions or special deals.
275
- - This observation suggests that while premium hotels may not always need to offer discounts to attract customers, they occasionally use them as a marketing strategy or for seasonal promotions.
276
- """)
277
-
278
- # Price vs Cashback scatter plot
279
- st.subheader("Price vs Cashback")
280
- fig, ax = plt.subplots(figsize=(7, 5))
281
- sns.scatterplot(x='cashback', y='price', data=data, color='blue')
282
- ax.set_title('Price vs Cashback')
283
- ax.set_xlabel('Cashback')
284
- ax.set_ylabel('Price')
285
- st.pyplot(fig)
286
-
287
- # Price vs Cashback
288
- st.subheader("Price vs Cashback:")
289
- st.write("""
290
- - **Analysis:**
291
- - Exceptions exist due to promotional campaigns.
292
- - While higher-priced hotels generally offer fewer cashback incentives, some may offer cashback due to specific promotional campaigns aimed at increasing sales volume or attracting customers in a competitive market.
293
- """)
294
-
295
- # Price vs Category bar plot
296
- st.subheader("Price vs Category")
297
- fig, ax = plt.subplots(figsize=(7, 5))
298
- sns.barplot(x='category', y='price', data=data, palette='Set2')
299
- ax.set_title('Price vs Category')
300
- ax.set_xlabel('Category')
301
- ax.set_ylabel('Price')
302
- st.pyplot(fig)
303
- # Price vs Category
304
- st.subheader("Price vs Category:")
305
- st.write("""
306
- - **Analysis:**
307
- - "Luxury" hotels have the highest prices, followed by "Premium" and "Free & Easy."
308
- - "Low Budget" and "Budget" hotels occupy the lower price range, showing that the category directly influences pricing strategy.
309
- - Categories like "Luxury" and "Premium" aim to target a specific market willing to pay more for superior quality and services, while "Budget" and "Low Budget" cater to a price-sensitive segment.
310
- """)
311
- # Summary of Scatter Plot Analysis
312
- st.subheader("Summary of Scatter Plot Analysis:")
313
- st.write("""
314
- - **Price vs Rating:** Higher-priced hotels generally offer better ratings, but the ratings vary widely across price points, indicating that factors such as service quality and amenities matter significantly.
315
- - **Price vs Discount:** High-priced hotels may still offer discounts due to seasonal promotions or special offers.
316
- - **Price vs Cashback:** Although high-priced hotels generally offer fewer cashback incentives, there are exceptions driven by promotional campaigns.
317
- - **Price vs Category:** "Luxury" hotels are the most expensive, followed by "Premium" and "Free & Easy" categories. On the other hand, "Low Budget" and "Budget" hotels have lower prices.
318
- - **Overall Insight:** The scatter plots reveal trends where higher-priced hotels tend to offer better ratings but fewer discounts and cashback incentives, while lower-priced categories tend to provide more promotional benefits such as discounts and cashback.
319
- """)
320
-
321
- # Rating vs Category bar plot
322
- st.subheader("Rating vs Category")
323
- fig, ax = plt.subplots(figsize=(7, 5))
324
- sns.barplot(x='category', y='rating', data=data, palette='Set1')
325
- ax.set_title('Rating vs Category')
326
- ax.set_xlabel('Category')
327
- ax.set_ylabel('Rating')
328
- st.pyplot(fig)
329
- # Rating vs Category
330
- st.subheader("Rating vs Category:")
331
- st.write("""
332
- - **Analysis:**
333
- - "Luxury" hotels lead in average ratings, followed by "Premium" hotels.
334
- - "Budget" and "Low Budget" categories show lower average ratings, indicating that these hotels may focus on price rather than offering premium services or experiences.
335
- - The disparity in ratings shows that customers tend to have higher expectations for luxury and premium accommodations, which are reflected in the ratings.
336
- """)
337
-
338
- # Discount vs Category box plot
339
- st.subheader("Discount vs Category")
340
- fig, ax = plt.subplots(figsize=(7, 5))
341
- sns.boxplot(x='category', y='discount', data=data, palette='Set2')
342
- ax.set_title('Discount vs Category')
343
- ax.set_xlabel('Category')
344
- ax.set_ylabel('Discount')
345
- st.pyplot(fig)
346
- # Discount vs Category
347
- st.subheader("Discount vs Category:")
348
- st.write("""
349
- - **Analysis:**
350
- - "Low Budget" hotels offer the highest discounts, while "Luxury" hotels provide the least discounts.
351
- - This suggests that budget-friendly hotels use discounts as a key strategy to attract price-sensitive customers, whereas luxury hotels focus on providing premium experiences without relying on price reductions.
352
- - The discount strategy varies by category, with lower-priced categories incentivizing customers with discounts to stay competitive.
353
- """)
354
- # Cashback vs Category violin plot
355
- st.subheader("Cashback vs Category")
356
- fig, ax = plt.subplots(figsize=(7, 5))
357
- sns.violinplot(x='category', y='cashback', data=data, palette='Set3')
358
- ax.set_title('Cashback vs Category')
359
- ax.set_xlabel('Category')
360
- ax.set_ylabel('Cashback')
361
- st.pyplot(fig)
362
- # Cashback vs Category
363
- st.subheader("Cashback vs Category:")
364
- st.write("""
365
- - **Analysis:**
366
- - Higher cashback offers are more common in "Low Budget" hotels, as these hotels rely on cashback incentives to attract customers looking for value deals.
367
- - Luxury hotels rarely provide cashback, as their target market is less likely to be motivated by such offers.
368
- - The trend highlights the different strategies employed by each category: budget options often provide financial incentives like cashback to drive bookings, while luxury options focus on premium services.
369
- """)
370
-
371
- # Reviews vs Category count plot
372
- st.subheader("Reviews vs Category")
373
- fig, ax = plt.subplots(figsize=(7, 5))
374
- sns.countplot(x='category', data=data, palette='Set1')
375
- ax.set_title('Reviews vs Category (Count)')
376
- ax.set_xlabel('Category')
377
- ax.set_ylabel('Count of Reviews')
378
- st.pyplot(fig)
379
- # Reviews vs Category
380
- st.subheader("Reviews vs Category:")
381
- st.write("""
382
- - **Analysis:**
383
- - "Luxury" hotels attract the most reviews, indicating that higher-quality accommodations often receive more feedback from customers.
384
- - "Budget" and "Low Budget" hotels tend to have fewer reviews, which may be due to their more straightforward offerings and smaller customer base.
385
- - This trend suggests that customers who opt for luxury hotels are more likely to share their experiences, whereas budget options may attract fewer repeat customers or have less word-of-mouth influence.
386
- """)
387
- # Summary of Bar and Box Plot Analysis
388
- st.subheader("Summary of Bar and Box Plot Analysis:")
389
- st.write("""
390
- - **Rating vs Category:** "Luxury" and "Premium" hotels have higher ratings on average, while "Budget" and "Low Budget" hotels show lower ratings.
391
- - **Discount vs Category:** Budget hotels, especially "Low Budget," offer higher discounts, while luxury hotels offer fewer discounts, relying more on their value proposition.
392
- - **Cashback vs Category:** "Low Budget" hotels offer higher cashback incentives, while luxury hotels rarely provide cashback, highlighting the pricing strategies in different categories.
393
- - **Reviews vs Category:** "Luxury" hotels attract the most reviews, while "Budget" and "Low Budget" hotels attract fewer.
394
- - **Overall Insight:** Bar and box plots reveal that higher-rated and more reviewed hotels tend to offer fewer discounts or cashback, focusing on customer experience, while budget categories focus on providing customer incentives to compete in the market.
395
- """)
396
- # Regional price analysis by state
397
- st.subheader("Price by State")
398
- fig, ax = plt.subplots(figsize=(16, 6))
399
- sns.barplot(data=data, x='state', y='price', ax=ax, color='green')
400
- ax.set_title('Price by State')
401
- ax.tick_params(axis='x', rotation=90)
402
- sns.set_palette('magma')
403
- plt.tight_layout()
404
- st.pyplot(fig)
405
- # Hotel Prices Across Indian States
406
- st.subheader("Hotel Prices Across Indian States:")
407
- st.write("""
408
- - **Analysis:**
409
- - Hotel prices vary significantly across Indian states, reflecting regional differences in demand and supply.
410
- - States with popular tourist destinations, such as Goa and Rajasthan, tend to show higher hotel prices, as they attract more visitors and have a higher demand for accommodations.
411
- - Conversely, states with less tourism or lower demand may exhibit more affordable pricing, catering to the local population or budget travelers.
412
- - Price differences also reflect factors such as local economic conditions, infrastructure, and tourism policies in each state.
413
- """)
414
-
415
- # Regional category count by state
416
- st.subheader("Category by State")
417
- fig, ax = plt.subplots(figsize=(16, 6))
418
- sns.countplot(data=data, x='state', hue='category', ax=ax, palette='Set1')
419
- ax.set_title('Category by State')
420
- ax.tick_params(axis='x', rotation=90)
421
- plt.tight_layout()
422
- st.pyplot(fig)
423
- # Hotel Categories by State
424
- st.subheader("Hotel Categories by State:")
425
- st.write("""
426
- - **Analysis:**
427
- - States with a higher concentration of "Low Budget" and "Budget" hotels cater primarily to cost-conscious travelers, offering affordable accommodations for a wide range of customers.
428
- - States with more "Luxury" hotels are likely to be major tourist hubs, such as Delhi, Mumbai, and Kerala, or regions that cater to premium audiences, offering high-end services for affluent customers.
429
- - These states may also focus on attracting international tourists or business travelers who prefer premium amenities and luxury experiences.
430
- """)
431
-
432
- # Summary of Regional Price and Category Trends
433
- st.subheader("Summary of Regional Price and Category Trends:")
434
- st.write("""
435
- - **Hotel Prices Across Indian States:** Prices vary significantly depending on the state, with tourist-heavy regions showing higher prices due to greater demand.
436
- - **Hotel Categories by State:** States with more budget hotels focus on catering to price-sensitive travelers, while states with luxury hotels cater to premium audiences, often in tourist hotspots.
437
- - **Overall Insight:** Regional trends indicate diverse pricing and category distributions, influenced by tourism, regional economics, and state-specific factors that shape hotel offerings across the country.
438
- """)
439
-
440
- st.title("Multivariate Analysis of Hotel Data")
441
-
442
- # Create a subset of the data for the analysis
443
- subset_data = data[['category', 'price', 'reviews', 'discount', 'cashback', 'rating']]
444
-
445
- # Section 1: Price vs. Reviews by Category
446
- st.header("Price vs. Reviews by Category")
447
- fig1 = sns.catplot(data=data, x='reviews', y='price', hue='category', kind='strip', palette='Set2', height=6, aspect=1.5)
448
- fig1.set_axis_labels("Reviews", "Price")
449
- fig1.fig.suptitle('Price vs Reviews by Category', fontsize=16)
450
- st.pyplot(fig1)
451
- # Analysis Text for Price vs. Reviews by Category
452
- st.write("""
453
- - **Price Variation within Categories:**
454
- - Wide price ranges exist within each category, with "Low Budget" hotels featuring both low- and high-priced options.
455
- - This shows that pricing within categories isn't always uniform and may depend on other factors like location, amenities, and hotel size.
456
-
457
- - **Price and Reviews Relationship:**
458
- - There's a slight tendency for hotels with more reviews to have higher prices, possibly due to the influence of popularity, better marketing efforts, or higher quality services.
459
-
460
- - **Summary:**
461
- - The stripplot reveals a weak positive correlation between price and reviews, indicating that well-reviewed hotels tend to have higher prices.
462
- """)
463
-
464
- # Section 2: Price vs. Discount by Category
465
- st.header("Price vs. Discount by Category")
466
- fig2 = sns.catplot(data=data, x='discount', y='price', hue='category', kind='bar', palette='Set2', height=6, aspect=1.5)
467
- fig2.set_axis_labels("Discount", "Price")
468
- fig2.fig.suptitle('Price vs Discount by Category', fontsize=16)
469
- st.pyplot(fig2)
470
- # Analysis Text for Price vs. Discount by Category
471
- st.write("""
472
- - **Price and Discount Relationship:**
473
- - The stripplot clearly shows that as hotel prices increase, discounts tend to decrease. This confirms a negative correlation between price and discount.
474
-
475
- - **Category-Specific Trends:**
476
- - "Low Budget" and "Budget" hotels offer much higher discounts compared to "Premium" and "Luxury" hotels, which typically offer fewer or smaller discounts.
477
- - This trend highlights that budget-conscious categories use higher discounts to attract customers, whereas premium and luxury categories rely on factors other than discounts (e.g., quality, exclusivity) to appeal to their clientele.
478
-
479
- - **Summary:**
480
- - The plot confirms that lower-priced hotels use higher discounts to attract customers, while premium and luxury hotels maintain lower discount rates, aligning with typical market behavior.
481
- """)
482
-
483
-
484
- # Section 3: Price vs Cashback and Rating by Category (Stripplot)
485
- st.header("Price vs Cashback and Rating by Category")
486
- fig3, axes2 = plt.subplots(1, 2, figsize=(16, 6))
487
-
488
- sns.stripplot(data=data, x='cashback', y='price', hue='category', ax=axes2[0], palette='Set2', jitter=True, dodge=True)
489
- axes2[0].set_title('Price vs Cashback by Category')
490
-
491
- sns.stripplot(data=data, x='rating', y='price', hue='category', ax=axes2[1], palette='Set2', jitter=True, dodge=True)
492
- axes2[1].set_title('Price vs Rating by Category')
493
-
494
- st.pyplot(fig3)
495
-
496
-
497
- # Analysis Text for Price vs Cashback by Category
498
- st.write("""
499
- - **Price vs Cashback by Category:**
500
- - The plot shows that cashback incentives tend to decrease as hotel prices increase. However, variations exist due to promotional offers that may affect cashback amounts.
501
- - Lower-priced categories, such as "Budget" and "Low Budget," offer higher cashback incentives to attract price-sensitive customers.
502
-
503
- - **Summary:**
504
- - The stripplot reveals a clear trend where lower-priced hotels use cashback as an incentive to boost bookings, whereas higher-priced hotels focus on other value propositions (e.g., premium services).
505
- """)
506
- # Analysis Text for Price vs Rating by Category
507
- st.write("""
508
- - **Price vs Rating by Category:**
509
- - Higher-priced categories like "Premium" and "Luxury" tend to have better ratings, indicating that customers perceive these hotels as providing superior value.
510
- - Interestingly, some lower-priced hotels achieve high ratings, suggesting that other factors such as service quality and customer experience may contribute to higher satisfaction despite the lower price point.
511
-
512
- - **Summary:**
513
- - The stripplot shows that while higher-priced hotels emphasize quality to achieve better ratings, lower-priced hotels still manage to deliver satisfactory experiences for customers through factors like service quality.
514
- """)
515
- # Section 4: Correlation Heatmap
516
- st.header("Correlation Matrix Heatmap")
517
- numeric_columns = ['price', 'reviews', 'discount', 'cashback', 'rating']
518
-
519
- # Compute the correlation matrix
520
- correlation_matrix = data[numeric_columns].corr()
521
-
522
- # Create a heatmap to visualize the correlation matrix
523
- plt.figure(figsize=(10, 8))
524
- sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
525
-
526
- # Set title for the plot
527
- plt.title('Correlation Matrix Heatmap')
528
-
529
- # Display the plot
530
- st.pyplot(plt)
531
-
532
- # Analysis Text for Correlation Heatmap
533
- st.write("""
534
- - **Price vs Discount/Cashback:**
535
- - The heatmap shows a strong negative correlation between price and both discount and cashback. Higher-priced hotels tend to offer fewer discounts and cashback incentives, suggesting that premium offerings rely on value rather than incentives.
536
-
537
- - **Price vs Rating:**
538
- - A weak positive correlation is observed between price and rating. Higher-priced hotels generally have slightly better ratings, although this relationship is not very strong.
539
-
540
- - **Reviews vs Rating:**
541
- - A moderate positive correlation exists between reviews and ratings. Hotels that attract more reviews tend to have better ratings, likely due to a larger customer base providing feedback.
542
-
543
- - **Reviews vs Price:**
544
- - A weak positive correlation between reviews and price suggests that well-reviewed hotels are often priced higher, likely due to their popularity and perceived value.
545
-
546
- - **Summary:**
547
- - The heatmap provides insights into the relationships between key variables in the dataset. Higher-priced hotels tend to offer fewer discounts and cashback offers but generally have better ratings and more reviews, indicating that customers are willing to pay a premium for a better experience.
548
- """)
549
-
550
- st.header("Overall Summary")
551
- st.write("""
552
- - Most properties are affordable, with lower prices, cashback, and discounts dominating the dataset.
553
- - Regional distribution shows states like Maharashtra and Madhya Pradesh having more hotels.
554
- - The data reflects a market focused on affordability and basic amenities, with regional and category-specific variations.
555
- - Cancellations and reviews provide further insights into customer behavior, while skewed distributions highlight potential outliers and trends in pricing and service offerings.
556
- """)
557
-
558
- st.header("Why Right-Skewed Trends Are Normal, Not Outliers")
559
- st.write("""
560
- - Right-skewed distributions for price, cashback, discounts, cancellations, reviews, and amenities are normal trends in the market.
561
- - These trends represent the expected distribution of data where higher values are less frequent but are not considered outliers.
562
- - The variations in cancellation patterns and review counts reflect typical customer behavior and industry dynamics.
563
- """)
564
- st.write("""
565
- Since no outliers were detected, we can proceed with model training and selection.
566
- With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
567
- """)
568
-
569
-
570
- else:
571
- st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")
572
-