Hotel_Data_Analysis / pages /3_EDA and Feature Engineering.py
Mpavan45's picture
Update pages/3_EDA and Feature Engineering.py
6cf4dcb verified
import streamlit as st
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
import sys
from imblearn.over_sampling import SMOTE
st.markdown("<h1 style='text-align:center; color:lime;'>EDA and Feature Engineering</h1>",unsafe_allow_html=True)
# Define the URL of the background image (use your own image URL)
background_image_url = "https://cdn-uploads.huggingface.co/production/uploads/675fab3a2d0851e23d23cad3/vulm4WwHmmA14tsVXYaTM.jpeg"
# # Apply custom CSS for the background image and overlay
st.markdown(
f"""
<style>
.stApp {{
background-image: url("{background_image_url}");
background-size: auto; /* Ensure the image width is 100% of the screen, and the height scales proportionally */
background-repeat: repeat-y; /* Repeat only vertically */
background-position: top center; /* Start repeating from the top center */
background-attachment: fixed; /* Keeps the background fixed as you scroll */
}}
/* Semi-transparent overlay */
.stApp::before {{
content: "";
position: absolute;
top: 0;
left: 0;
width: 100%;
height: 100%;
background: rgba(0, 0, 0, 0.4); /* Adjust transparency here (0.4 for 40% transparency) */
z-index: -1;
}}
/* Container to center elements and limit width */
.content-container {{
max-width: 70%; /* Limit content width to 70% */
margin: 0 auto; /* Center the container */
padding: 50px; /* Add some padding for spacing */
}}
/* Styling the markdown content */
.stMarkdown {{
color: white; /* White text to ensure visibility */
font-size: 100px; /* Adjust font size for readability */
}}
</style>
""",
unsafe_allow_html=True
)
# Title of the Streamlit app
st.title("Exploratory Data Analysis (EDA) on Agoda Hotel Dataset")
# Introduction and Aim
st.header("Aim of the EDA")
st.write("""
The main objective of this EDA is to analyze Agoda's hotel dataset to identify key factors influencing hotel pricing strategies and customer booking preferences.
The analysis will focus on uncovering patterns, trends, and relationships in hotel ratings, pricing structures, discounts, and free services.
By leveraging these insights, Agoda can optimize its pricing strategy, predict booking preferences, and enhance revenue generation while maintaining customer satisfaction.
""")
# Description of the Data
st.header("Description of the Data")
st.write("""
**Overall Summary:** We are analyzing the Agoda dataset by performing EDA and Statistical Tests on the data that has already been cleaned through data wrangling to address any messiness or missing information.
**Table - Agoda_df:** The cleaned dataset consists of over 3,500 hotel listings, which will be used as test subjects for the hotel pricing period.
**Dataset Details:**
The dataset contains information about 3,219 hotel room listings with 12 features, each detailing aspects of the listing. Below is the description of each column:
| Column Name | Description |
|-----------------|---------------------------------------------------------------------------|
| hotel_name | Name of the hotel. |
| rating | Average customer rating of the hotel (float, range 1-5). |
| location | Address or locality of the hotel. |
| review_text | Customer feedback or comments about the hotel. |
| reviews | Total number of customer reviews for the hotel. |
| cashback | Cashback amount offered for the booking. |
| discount | Discount percentage applied to the room price. |
| free_services | Free services provided (e.g., breakfast, Wi-Fi). |
| cancellation | Cancellation policy for the booking (e.g., free, non-refundable). |
| price | Price of the room after discounts and cashback (float). |
| state | The state where the hotel is located. |
| category | Target variable representing the room type or category (e.g., budget, luxury). |
""")
# Table-wise EDA & Necessary Tests
st.header("Table-wise EDA and Necessary Statistical Tests")
st.write("""
**Agoda_df:** Cleaned dataset with hotel details and key features like ratings, price, reviews, cashback, discounts, and free services.
The EDA will involve the following steps:
- **Summary Statistics:** Analyze the central tendency, spread, and shape of the distribution of each feature.
- **Data Distribution:** Visualize the distribution of key features like price, ratings, reviews, cashback, etc.
- **Correlation Analysis:** Analyze relationships between numeric features like price, ratings, reviews, cashback, etc.
- **Categorical Data Analysis:** Explore categorical variables like hotel category, cancellation policy, state, and location using frequency tables and visualizations.
- **Missing Value Analysis:** Ensure no missing values remain, and check the need for imputations.
- **Outlier Detection:** Identify any outliers that may skew the analysis or predictions.
- **Statistical Tests:** Apply appropriate statistical tests to identify significant differences or relationships (e.g., t-tests for comparing means, chi-squared for categorical variables).
""")
# Placeholder for further detailed code or visualizations
st.write("Further steps will include generating visualizations and statistical tests to explore relationships between features in more detail.")
# Check if the cleaned data exists in session state
if 'cleaned_data' in st.session_state:
data = st.session_state.cleaned_data
# Display the cleaned data on the second page
st.subheader("Cleaned Data from Page 1:")
st.write(data)
st.write("""
### **Dataset Information**
The cleaned dataset has been successfully accessed from the session state.
""")
st.subheader("Dataset Preview:")
st.write(data) # Display the first 5 rows
st.subheader("Info of the Dataset:")
# Redirect the output of df.info() to a string buffer
buffer = StringIO()
data.info(buf=buffer)
# Display the content in Streamlit as Markdown
st.subheader("Info of the Dataset:")
st.markdown(f"```{buffer.getvalue()}```")
st.subheader("Dataset Statistics:")
st.write(data.describe())
st.subheader("Dataset Shape (Rows, Columns):")
st.write(data.shape)
data= data[data["price"] <= 40000] # Keep rows where price is less than or equal to 40,000
### univariate_analysis.
st.success("Dataset successfully loaded from session state!")
st.subheader("Univariate Analysis")
# Rating and Review Text Distribution
st.subheader("Rating and Review Text Distribution")
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
data["rating"].value_counts().plot(kind='pie', title='Distribution of Ratings', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[0])
axs[0].set_title("Distribution of Ratings")
data['review_text'].value_counts().plot(kind='pie', title='Distribution of Review Text', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[1])
axs[1].set_title("Distribution of Review Text")
st.pyplot(fig)
# Hotel Star Insights
st.write("""
**Insight:**
- Majority of hotels in this data are 3-star hotels.
- Frequency of 4-star and 5-star hotels are also moderately good.
- 1-star and 2-star hotels are lower in frequency.
""")
# Price, Cashback, and Discount Distribution
st.subheader("Price, Cashback, and Discount Distribution")
fig, axs = plt.subplots(1, 3, figsize=(16, 6))
sns.histplot(data=data, x='price', color='green', kde=True, ax=axs[0])
axs[0].set_title("Count based on price")
axs[0].set_xlabel('Price')
axs[0].set_ylabel('Number of People')
sns.histplot(data=data, x='cashback', color='violet', kde=True, ax=axs[1])
axs[1].set_title("Count based on cashback")
axs[1].set_xlabel('Cashback')
axs[1].set_ylabel('Number of People')
sns.histplot(data=data, x='discount', color='orange', kde=True, ax=axs[2])
axs[2].set_title("Count based on discount")
axs[2].set_xlabel('Discount')
st.pyplot(fig)
# Histogram Insights
st.subheader("Plot-wise Analysis of Histograms")
st.write("""
**Price Distribution Insight:**
- The histogram is right-skewed, showing most properties are in the lower price range.
- A long tail indicates the presence of a few very expensive properties.
**Cashback Distribution Insight:**
- The histogram is right-skewed, with the majority of properties offering lower cashback amounts.
- Only a small number of properties provide higher cashback.
**Discount Distribution Insight:**
- The histogram is right-skewed, indicating that most properties offer lower discount percentages.
- A few properties stand out with higher discounts.
**Summary:**
The data suggest that Agoda properties are generally affordable, with lower cashback and discount offers being common. Further statistical analysis could help uncover more detailed insights.
""")
# Cancellation and State Distribution
st.subheader("Cancellation and State Distribution")
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
data["cancellation"].value_counts().plot(kind='bar', title='Distribution of Cancellation', color='red', ax=axs[0])
axs[0].set_title("Distribution of Cancellation")
axs[0].set_xlabel('Cancellation')
axs[0].set_ylabel('Number of Hotels')
data["state"].value_counts().plot(kind='bar', title='Distribution of State', color='black', ax=axs[1])
axs[1].set_title("Distribution of State")
axs[1].set_xlabel('State')
axs[1].set_ylabel('Number of Hotels')
st.pyplot(fig)
# Bar Chart Insights
st.subheader("Plot Wise Analysis of Bar Charts")
st.write("""
**Cancellations:**
- Most cancellations fall under category "1," indicating they occur within specific conditions or timeframes.
**State Distribution:**
- "Maharashtra" has the highest number of hotels, followed by "Madhya Pradesh."
- Other states like Gujarat, Karnataka, and Kerala also have notable hotel counts.
- The distribution is uneven, with some states having significantly more hotels.
**Summary:**
The charts highlight cancellation trends and the regional hotel distribution in India.
""")
# Category and Reviews Distribution
st.subheader("Category and Reviews Distribution")
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
colors = sns.color_palette('Set2', n_colors=len(data["category"].value_counts()))
data["category"].value_counts().plot(kind='bar', ax=axs[0], color=colors)
axs[0].set_title("Distribution of Category")
axs[0].set_xlabel('Category')
axs[0].set_ylabel('Number of Hotels')
sns.histplot(data=data, x='reviews', color='violet', kde=True, ax=axs[1])
axs[1].set_title("Count based on Reviews")
axs[1].set_xlabel('Reviews')
axs[1].set_ylabel('Number of Reviews')
st.pyplot(fig)
# Hotel Categories and Reviews Insights
st.subheader("Plot Wise Analysis of Hotel Categories and Reviews")
st.write("""
**Category Distribution:**
- The histogram shows "Low Budget" hotels are the most common, followed by "Budget Hotels," while "Luxury Hotels" are the least common.
**Review Count Distribution:**
- The histogram is right-skewed, with most hotels having a low number of reviews.
- A few hotels have a very high number of reviews, evident from the long tail.
**Summary:**
The data indicates a higher concentration of low-budget hotels and relatively low review counts for most hotels.
""")
if 'free_services' in data.columns:
# Convert non-string values to strings and handle missing values
data['free_services'] = data['free_services'].fillna('').astype(str)
# Perform the string operations
amenity_counts = (
data['free_services']
.str.split(',') # Split the strings by commas
.explode() # Flatten the lists into individual rows
.str.strip() # Remove leading/trailing spaces
.value_counts() # Count occurrences
.reset_index() # Convert to a DataFrame
)
amenity_counts.columns = ['Amenity', 'Count'] # Rename columns for clarity
st.write(amenity_counts)
else:
st.write("The 'free_services' column does not exist in the dataset.")
# Top Amenities Insights
st.subheader("Plot Wise Analysis of Top Amenities")
st.write("""
**Common Amenities:**
- Complimentary Parking is the most frequently offered amenity.
- Basic Toiletries and Hair Dryers are also widely available.
**Less Common Amenities:**
- Fitness Center Access, Welcome Drinks, and Turndown Service are less common.
- Shoe Shine Service is the least frequently offered amenity.
**Summary:**
Hotels tend to prioritize basic amenities like parking, toiletries, and hair dryers, while luxurious amenities are offered less frequently.
""")
st.title("Bivariate Analysis")
# Price vs Rating scatter plot
st.subheader("Price vs Rating")
fig, ax = plt.subplots(figsize=(7, 5))
sns.scatterplot(x='rating', y='price', data=data, color='orange')
ax.set_title('Price vs Rating')
ax.set_xlabel('Rating')
ax.set_ylabel('Price')
st.pyplot(fig)
# Price vs Rating
st.subheader("Price vs Rating:")
st.write("""
- **Analysis:**
- Higher-priced hotels slightly tend to have better ratings, but ratings vary widely across price points.
- Hotels at various price points exhibit a large spread in ratings, meaning factors other than price (such as customer experience or amenities) contribute significantly to the rating.
""")
# Price vs Discount scatter plot
st.subheader("Price vs Discount")
fig, ax = plt.subplots(figsize=(7, 5))
sns.scatterplot(x='discount', y='price', data=data, color='green')
ax.set_title('Price vs Discount')
ax.set_xlabel('Discount')
ax.set_ylabel('Price')
st.pyplot(fig)
# Price vs Discount
st.subheader("Price vs Discount:")
st.write("""
- **Analysis:**
- Some high-priced hotels still provide discounts due to promotions or special deals.
- This observation suggests that while premium hotels may not always need to offer discounts to attract customers, they occasionally use them as a marketing strategy or for seasonal promotions.
""")
# Price vs Cashback scatter plot
st.subheader("Price vs Cashback")
fig, ax = plt.subplots(figsize=(7, 5))
sns.scatterplot(x='cashback', y='price', data=data, color='blue')
ax.set_title('Price vs Cashback')
ax.set_xlabel('Cashback')
ax.set_ylabel('Price')
st.pyplot(fig)
# Price vs Cashback
st.subheader("Price vs Cashback:")
st.write("""
- **Analysis:**
- Exceptions exist due to promotional campaigns.
- While higher-priced hotels generally offer fewer cashback incentives, some may offer cashback due to specific promotional campaigns aimed at increasing sales volume or attracting customers in a competitive market.
""")
# Price vs Category bar plot
st.subheader("Price vs Category")
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x='category', y='price', data=data, palette='Set2')
ax.set_title('Price vs Category')
ax.set_xlabel('Category')
ax.set_ylabel('Price')
st.pyplot(fig)
# Price vs Category
st.subheader("Price vs Category:")
st.write("""
- **Analysis:**
- "Luxury" hotels have the highest prices, followed by "Premium" and "Free & Easy."
- "Low Budget" and "Budget" hotels occupy the lower price range, showing that the category directly influences pricing strategy.
- Categories like "Luxury" and "Premium" aim to target a specific market willing to pay more for superior quality and services, while "Budget" and "Low Budget" cater to a price-sensitive segment.
""")
# Summary of Scatter Plot Analysis
st.subheader("Summary of Scatter Plot Analysis:")
st.write("""
- **Price vs Rating:** Higher-priced hotels generally offer better ratings, but the ratings vary widely across price points, indicating that factors such as service quality and amenities matter significantly.
- **Price vs Discount:** High-priced hotels may still offer discounts due to seasonal promotions or special offers.
- **Price vs Cashback:** Although high-priced hotels generally offer fewer cashback incentives, there are exceptions driven by promotional campaigns.
- **Price vs Category:** "Luxury" hotels are the most expensive, followed by "Premium" and "Free & Easy" categories. On the other hand, "Low Budget" and "Budget" hotels have lower prices.
- **Overall Insight:** The scatter plots reveal trends where higher-priced hotels tend to offer better ratings but fewer discounts and cashback incentives, while lower-priced categories tend to provide more promotional benefits such as discounts and cashback.
""")
# Rating vs Category bar plot
st.subheader("Rating vs Category")
fig, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x='category', y='rating', data=data, palette='Set1')
ax.set_title('Rating vs Category')
ax.set_xlabel('Category')
ax.set_ylabel('Rating')
st.pyplot(fig)
# Rating vs Category
st.subheader("Rating vs Category:")
st.write("""
- **Analysis:**
- "Luxury" hotels lead in average ratings, followed by "Premium" hotels.
- "Budget" and "Low Budget" categories show lower average ratings, indicating that these hotels may focus on price rather than offering premium services or experiences.
- The disparity in ratings shows that customers tend to have higher expectations for luxury and premium accommodations, which are reflected in the ratings.
""")
# Discount vs Category box plot
st.subheader("Discount vs Category")
fig, ax = plt.subplots(figsize=(7, 5))
sns.boxplot(x='category', y='discount', data=data, palette='Set2')
ax.set_title('Discount vs Category')
ax.set_xlabel('Category')
ax.set_ylabel('Discount')
st.pyplot(fig)
# Discount vs Category
st.subheader("Discount vs Category:")
st.write("""
- **Analysis:**
- "Low Budget" hotels offer the highest discounts, while "Luxury" hotels provide the least discounts.
- This suggests that budget-friendly hotels use discounts as a key strategy to attract price-sensitive customers, whereas luxury hotels focus on providing premium experiences without relying on price reductions.
- The discount strategy varies by category, with lower-priced categories incentivizing customers with discounts to stay competitive.
""")
# Cashback vs Category violin plot
st.subheader("Cashback vs Category")
fig, ax = plt.subplots(figsize=(7, 5))
sns.violinplot(x='category', y='cashback', data=data, palette='Set3')
ax.set_title('Cashback vs Category')
ax.set_xlabel('Category')
ax.set_ylabel('Cashback')
st.pyplot(fig)
# Cashback vs Category
st.subheader("Cashback vs Category:")
st.write("""
- **Analysis:**
- Higher cashback offers are more common in "Low Budget" hotels, as these hotels rely on cashback incentives to attract customers looking for value deals.
- Luxury hotels rarely provide cashback, as their target market is less likely to be motivated by such offers.
- The trend highlights the different strategies employed by each category: budget options often provide financial incentives like cashback to drive bookings, while luxury options focus on premium services.
""")
# Reviews vs Category count plot
st.subheader("Reviews vs Category")
fig, ax = plt.subplots(figsize=(7, 5))
sns.countplot(x='category', data=data, palette='Set1')
ax.set_title('Reviews vs Category (Count)')
ax.set_xlabel('Category')
ax.set_ylabel('Count of Reviews')
st.pyplot(fig)
# Reviews vs Category
st.subheader("Reviews vs Category:")
st.write("""
- **Analysis:**
- "Luxury" hotels attract the most reviews, indicating that higher-quality accommodations often receive more feedback from customers.
- "Budget" and "Low Budget" hotels tend to have fewer reviews, which may be due to their more straightforward offerings and smaller customer base.
- This trend suggests that customers who opt for luxury hotels are more likely to share their experiences, whereas budget options may attract fewer repeat customers or have less word-of-mouth influence.
""")
# Summary of Bar and Box Plot Analysis
st.subheader("Summary of Bar and Box Plot Analysis:")
st.write("""
- **Rating vs Category:** "Luxury" and "Premium" hotels have higher ratings on average, while "Budget" and "Low Budget" hotels show lower ratings.
- **Discount vs Category:** Budget hotels, especially "Low Budget," offer higher discounts, while luxury hotels offer fewer discounts, relying more on their value proposition.
- **Cashback vs Category:** "Low Budget" hotels offer higher cashback incentives, while luxury hotels rarely provide cashback, highlighting the pricing strategies in different categories.
- **Reviews vs Category:** "Luxury" hotels attract the most reviews, while "Budget" and "Low Budget" hotels attract fewer.
- **Overall Insight:** Bar and box plots reveal that higher-rated and more reviewed hotels tend to offer fewer discounts or cashback, focusing on customer experience, while budget categories focus on providing customer incentives to compete in the market.
""")
# Regional price analysis by state
st.subheader("Price by State")
fig, ax = plt.subplots(figsize=(16, 6))
sns.barplot(data=data, x='state', y='price', ax=ax, color='green')
ax.set_title('Price by State')
ax.tick_params(axis='x', rotation=90)
sns.set_palette('magma')
plt.tight_layout()
st.pyplot(fig)
# Hotel Prices Across Indian States
st.subheader("Hotel Prices Across Indian States:")
st.write("""
- **Analysis:**
- Hotel prices vary significantly across Indian states, reflecting regional differences in demand and supply.
- States with popular tourist destinations, such as Goa and Rajasthan, tend to show higher hotel prices, as they attract more visitors and have a higher demand for accommodations.
- Conversely, states with less tourism or lower demand may exhibit more affordable pricing, catering to the local population or budget travelers.
- Price differences also reflect factors such as local economic conditions, infrastructure, and tourism policies in each state.
""")
# Regional category count by state
st.subheader("Category by State")
fig, ax = plt.subplots(figsize=(16, 6))
sns.countplot(data=data, x='state', hue='category', ax=ax, palette='Set1')
ax.set_title('Category by State')
ax.tick_params(axis='x', rotation=90)
plt.tight_layout()
st.pyplot(fig)
# Hotel Categories by State
st.subheader("Hotel Categories by State:")
st.write("""
- **Analysis:**
- States with a higher concentration of "Low Budget" and "Budget" hotels cater primarily to cost-conscious travelers, offering affordable accommodations for a wide range of customers.
- States with more "Luxury" hotels are likely to be major tourist hubs, such as Delhi, Mumbai, and Kerala, or regions that cater to premium audiences, offering high-end services for affluent customers.
- These states may also focus on attracting international tourists or business travelers who prefer premium amenities and luxury experiences.
""")
# Summary of Regional Price and Category Trends
st.subheader("Summary of Regional Price and Category Trends:")
st.write("""
- **Hotel Prices Across Indian States:** Prices vary significantly depending on the state, with tourist-heavy regions showing higher prices due to greater demand.
- **Hotel Categories by State:** States with more budget hotels focus on catering to price-sensitive travelers, while states with luxury hotels cater to premium audiences, often in tourist hotspots.
- **Overall Insight:** Regional trends indicate diverse pricing and category distributions, influenced by tourism, regional economics, and state-specific factors that shape hotel offerings across the country.
""")
st.title("Multivariate Analysis of Hotel Data")
# Create a subset of the data for the analysis
subset_data = data[['category', 'price', 'reviews', 'discount', 'cashback', 'rating']]
# Section 1: Price vs. Reviews by Category
st.header("Price vs. Reviews by Category")
fig1 = sns.catplot(data=data, x='reviews', y='price', hue='category', kind='strip', palette='Set2', height=6, aspect=1.5)
fig1.set_axis_labels("Reviews", "Price")
fig1.fig.suptitle('Price vs Reviews by Category', fontsize=16)
st.pyplot(fig1)
# Analysis Text for Price vs. Reviews by Category
st.write("""
- **Price Variation within Categories:**
- Wide price ranges exist within each category, with "Low Budget" hotels featuring both low- and high-priced options.
- This shows that pricing within categories isn't always uniform and may depend on other factors like location, amenities, and hotel size.
- **Price and Reviews Relationship:**
- There's a slight tendency for hotels with more reviews to have higher prices, possibly due to the influence of popularity, better marketing efforts, or higher quality services.
- **Summary:**
- The stripplot reveals a weak positive correlation between price and reviews, indicating that well-reviewed hotels tend to have higher prices.
""")
# Section 2: Price vs. Discount by Category
st.header("Price vs. Discount by Category")
fig2 = sns.catplot(data=data, x='discount', y='price', hue='category', kind='bar', palette='Set2', height=6, aspect=1.5)
fig2.set_axis_labels("Discount", "Price")
fig2.fig.suptitle('Price vs Discount by Category', fontsize=16)
st.pyplot(fig2)
# Analysis Text for Price vs. Discount by Category
st.write("""
- **Price and Discount Relationship:**
- The stripplot clearly shows that as hotel prices increase, discounts tend to decrease. This confirms a negative correlation between price and discount.
- **Category-Specific Trends:**
- "Low Budget" and "Budget" hotels offer much higher discounts compared to "Premium" and "Luxury" hotels, which typically offer fewer or smaller discounts.
- This trend highlights that budget-conscious categories use higher discounts to attract customers, whereas premium and luxury categories rely on factors other than discounts (e.g., quality, exclusivity) to appeal to their clientele.
- **Summary:**
- The plot confirms that lower-priced hotels use higher discounts to attract customers, while premium and luxury hotels maintain lower discount rates, aligning with typical market behavior.
""")
# Section 3: Price vs Cashback and Rating by Category (Stripplot)
st.header("Price vs Cashback and Rating by Category")
fig3, axes2 = plt.subplots(1, 2, figsize=(16, 6))
sns.stripplot(data=data, x='cashback', y='price', hue='category', ax=axes2[0], palette='Set2', jitter=True, dodge=True)
axes2[0].set_title('Price vs Cashback by Category')
sns.stripplot(data=data, x='rating', y='price', hue='category', ax=axes2[1], palette='Set2', jitter=True, dodge=True)
axes2[1].set_title('Price vs Rating by Category')
st.pyplot(fig3)
# Analysis Text for Price vs Cashback by Category
st.write("""
- **Price vs Cashback by Category:**
- The plot shows that cashback incentives tend to decrease as hotel prices increase. However, variations exist due to promotional offers that may affect cashback amounts.
- Lower-priced categories, such as "Budget" and "Low Budget," offer higher cashback incentives to attract price-sensitive customers.
- **Summary:**
- The stripplot reveals a clear trend where lower-priced hotels use cashback as an incentive to boost bookings, whereas higher-priced hotels focus on other value propositions (e.g., premium services).
""")
# Analysis Text for Price vs Rating by Category
st.write("""
- **Price vs Rating by Category:**
- Higher-priced categories like "Premium" and "Luxury" tend to have better ratings, indicating that customers perceive these hotels as providing superior value.
- Interestingly, some lower-priced hotels achieve high ratings, suggesting that other factors such as service quality and customer experience may contribute to higher satisfaction despite the lower price point.
- **Summary:**
- The stripplot shows that while higher-priced hotels emphasize quality to achieve better ratings, lower-priced hotels still manage to deliver satisfactory experiences for customers through factors like service quality.
""")
# Section 4: Correlation Heatmap
st.header("Correlation Matrix Heatmap")
numeric_columns = ['price', 'reviews', 'discount', 'cashback', 'rating']
# Compute the correlation matrix
correlation_matrix = data[numeric_columns].corr()
# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
# Set title for the plot
plt.title('Correlation Matrix Heatmap')
# Display the plot
st.pyplot(plt)
# Analysis Text for Correlation Heatmap
st.write("""
- **Price vs Discount/Cashback:**
- The heatmap shows a strong negative correlation between price and both discount and cashback. Higher-priced hotels tend to offer fewer discounts and cashback incentives, suggesting that premium offerings rely on value rather than incentives.
- **Price vs Rating:**
- A weak positive correlation is observed between price and rating. Higher-priced hotels generally have slightly better ratings, although this relationship is not very strong.
- **Reviews vs Rating:**
- A moderate positive correlation exists between reviews and ratings. Hotels that attract more reviews tend to have better ratings, likely due to a larger customer base providing feedback.
- **Reviews vs Price:**
- A weak positive correlation between reviews and price suggests that well-reviewed hotels are often priced higher, likely due to their popularity and perceived value.
- **Summary:**
- The heatmap provides insights into the relationships between key variables in the dataset. Higher-priced hotels tend to offer fewer discounts and cashback offers but generally have better ratings and more reviews, indicating that customers are willing to pay a premium for a better experience.
""")
st.header("Overall Summary")
st.write("""
- Most properties are affordable, with lower prices, cashback, and discounts dominating the dataset.
- Regional distribution shows states like Maharashtra and Madhya Pradesh having more hotels.
- The data reflects a market focused on affordability and basic amenities, with regional and category-specific variations.
- Cancellations and reviews provide further insights into customer behavior, while skewed distributions highlight potential outliers and trends in pricing and service offerings.
""")
st.header("Why Right-Skewed Trends Are Normal, Not Outliers")
st.write("""
- Right-skewed distributions for price, cashback, discounts, cancellations, reviews, and amenities are normal trends in the market.
- These trends represent the expected distribution of data where higher values are less frequent but are not considered outliers.
- The variations in cancellation patterns and review counts reflect typical customer behavior and industry dynamics.
""")
st.write("""
Since no outliers were detected, we can proceed with model training and selection.
With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
""")
# Title for Streamlit App
st.title("Feature Engineering on Dataset")
# Feature Engineering Explanation
st.markdown("""
### What is Feature Engineering?
Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models.
It involves techniques such as:
- Encoding categorical variables into numerical values
- Handling class imbalances
- Selecting and transforming features to enhance model accuracy
In this app, we will apply feature engineering techniques to prepare the dataset for analysis and modeling.
""")
# Load Dataset (you can replace this with actual file upload)
# Read the dataset
# Creating a working copy of the dataset
st.subheader("Creating a Working Copy of the Dataset")
st.markdown("""
To ensure that the original dataset remains intact, we create a working copy of the dataset named `df`.
This allows us to make transformations and modifications without altering the uploaded data.
""")
df = data.copy()
st.subheader("Dataset Preview:")
st.write(df) # Display the first 5 rows
st.subheader("Info of the Dataset:")
# Redirect the output of df.info() to a string buffer
buffer = StringIO()
df.info(buf=buffer)
# Display the content in Streamlit
st.write(buffer.getvalue())
st.subheader("Dataset Statistics:")
st.write(df.describe())
st.subheader("Dataset Shape (Rows, Columns):")
st.write(df.shape)
# Checking the number of categories in the 'category' column
st.subheader("Category Distribution")
st.markdown("""
**Step 1: Mapping Categories**
- The `category` column contains categorical data representing different types of hotels (e.g., Low Budget, Luxury).
- These categories are mapped to numerical values to make them suitable for machine learning models.
""")
st.write("Category Value Counts (Before Mapping):")
st.write(df["category"].value_counts())
# Mapping Agoda hotel categories to numerical values
category_mapping = {
"Low Budget": 0,
"Budget Hotels": 1,
"Mid-Range Hotels": 2,
"Premium Hotels": 3,
"Luxury Hotels": 4,
}
df["category"] = df["category"].map(category_mapping)
st.write("Category Value Counts (After Mapping):")
st.write(df["category"].value_counts())
# Encoding 'state' column
st.subheader("State Encoding")
st.markdown("""
**Step 2: Encoding States**
- The `state` column contains categorical location data.
- We encode it into numerical values using `astype('category').cat.codes`, where each unique state is assigned a unique integer.
""")
st.write("State Value Counts (Before Encoding):")
st.write(df["state"].value_counts())
df["state"] = df["state"].astype("category").cat.codes
st.write("State Value Counts (After Encoding):")
st.write(df["state"].value_counts())
# Splitting the dataset into feature and target variables
st.subheader("Splitting Features and Target")
st.markdown("""
**Step 3: Splitting the Dataset**
- The dataset is split into two parts:
- **Feature Variables (X):** Attributes used for prediction.
- **Target Variable (y):** The value we want to predict (e.g., `price` or `category`).
""")
feature_variables = df.iloc[:, 0:-1] # Feature variables
target_variable = df.iloc[:, -1] # Target variable
st.write("Feature Variables (from Dataset):", feature_variables.head())
st.write("Target Variable (from Dataset):", target_variable.head())
# Selecting specific features for analysis
X = feature_variables[["rating", "reviews", "cashback", "discount", "state", "price"]]
y = target_variable
st.write("Selected Features (X) from Dataset:", X.head())
st.write("Target Variable (y) from Dataset:", y.head())
# Balancing the Dataset with SMOTE
st.subheader("Balancing Dataset with SMOTE")
st.markdown("""
**Step 4: Handling Imbalanced Classes**
- Imbalanced datasets can bias models towards majority classes.
- We use **SMOTE (Synthetic Minority Over-sampling Technique)** to generate synthetic samples for underrepresented classes, ensuring a balanced dataset.
""")
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
st.write("Balanced Dataset (X_res):", X_res.head())
st.write("Balanced Target Variable (y_res) Distribution:")
st.write(y_res.value_counts())
# Save data in session state
st.session_state['X_res'] = X_res
st.session_state['y_res'] = y_res
else:
st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")
st.markdown(
"""
<style>
.custom-button {
display: inline-block;
padding: 5px 10px;
font-size: 14px;
color: #ffffff;
background-color: #4CAF50;
border: none;
border-radius: 5px;
text-align: center;
text-decoration: none;
transition: background-color 0.3s ease, transform 0.2s ease;
cursor: pointer;
}
.custom-button:hover {
background-color: #45a049;
transform: scale(1.05);
}
.button-container {
display: flex;
justify-content: space-between;
margin-top: 20px;
}
</style>
""",
unsafe_allow_html=True,
)
# Navigation Buttons
st.markdown(
"""
<div class="button-container">
<a href="pages/2_Data_Cleaning_and_Processing" target="_self" class="custom-button">Previous ⏮️</a>
<a href="pages/4_Model_Creation_and_Evaluation" target="_self" class="custom-button">Next ⏭️</a>
</div>
""",
unsafe_allow_html=True,
)