Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| import numpy as np | |
| import pandas as pd | |
| import matplotlib.pyplot as plt | |
| import seaborn as sns | |
| from io import StringIO | |
| import sys | |
| from imblearn.over_sampling import SMOTE | |
| st.markdown("<h1 style='text-align:center; color:lime;'>EDA and Feature Engineering</h1>",unsafe_allow_html=True) | |
| # Define the URL of the background image (use your own image URL) | |
| background_image_url = "https://cdn-uploads.huggingface.co/production/uploads/675fab3a2d0851e23d23cad3/vulm4WwHmmA14tsVXYaTM.jpeg" | |
| # # Apply custom CSS for the background image and overlay | |
| st.markdown( | |
| f""" | |
| <style> | |
| .stApp {{ | |
| background-image: url("{background_image_url}"); | |
| background-size: auto; /* Ensure the image width is 100% of the screen, and the height scales proportionally */ | |
| background-repeat: repeat-y; /* Repeat only vertically */ | |
| background-position: top center; /* Start repeating from the top center */ | |
| background-attachment: fixed; /* Keeps the background fixed as you scroll */ | |
| }} | |
| /* Semi-transparent overlay */ | |
| .stApp::before {{ | |
| content: ""; | |
| position: absolute; | |
| top: 0; | |
| left: 0; | |
| width: 100%; | |
| height: 100%; | |
| background: rgba(0, 0, 0, 0.4); /* Adjust transparency here (0.4 for 40% transparency) */ | |
| z-index: -1; | |
| }} | |
| /* Container to center elements and limit width */ | |
| .content-container {{ | |
| max-width: 70%; /* Limit content width to 70% */ | |
| margin: 0 auto; /* Center the container */ | |
| padding: 50px; /* Add some padding for spacing */ | |
| }} | |
| /* Styling the markdown content */ | |
| .stMarkdown {{ | |
| color: white; /* White text to ensure visibility */ | |
| font-size: 100px; /* Adjust font size for readability */ | |
| }} | |
| </style> | |
| """, | |
| unsafe_allow_html=True | |
| ) | |
| # Title of the Streamlit app | |
| st.title("Exploratory Data Analysis (EDA) on Agoda Hotel Dataset") | |
| # Introduction and Aim | |
| st.header("Aim of the EDA") | |
| st.write(""" | |
| The main objective of this EDA is to analyze Agoda's hotel dataset to identify key factors influencing hotel pricing strategies and customer booking preferences. | |
| The analysis will focus on uncovering patterns, trends, and relationships in hotel ratings, pricing structures, discounts, and free services. | |
| By leveraging these insights, Agoda can optimize its pricing strategy, predict booking preferences, and enhance revenue generation while maintaining customer satisfaction. | |
| """) | |
| # Description of the Data | |
| st.header("Description of the Data") | |
| st.write(""" | |
| **Overall Summary:** We are analyzing the Agoda dataset by performing EDA and Statistical Tests on the data that has already been cleaned through data wrangling to address any messiness or missing information. | |
| **Table - Agoda_df:** The cleaned dataset consists of over 3,500 hotel listings, which will be used as test subjects for the hotel pricing period. | |
| **Dataset Details:** | |
| The dataset contains information about 3,219 hotel room listings with 12 features, each detailing aspects of the listing. Below is the description of each column: | |
| | Column Name | Description | | |
| |-----------------|---------------------------------------------------------------------------| | |
| | hotel_name | Name of the hotel. | | |
| | rating | Average customer rating of the hotel (float, range 1-5). | | |
| | location | Address or locality of the hotel. | | |
| | review_text | Customer feedback or comments about the hotel. | | |
| | reviews | Total number of customer reviews for the hotel. | | |
| | cashback | Cashback amount offered for the booking. | | |
| | discount | Discount percentage applied to the room price. | | |
| | free_services | Free services provided (e.g., breakfast, Wi-Fi). | | |
| | cancellation | Cancellation policy for the booking (e.g., free, non-refundable). | | |
| | price | Price of the room after discounts and cashback (float). | | |
| | state | The state where the hotel is located. | | |
| | category | Target variable representing the room type or category (e.g., budget, luxury). | | |
| """) | |
| # Table-wise EDA & Necessary Tests | |
| st.header("Table-wise EDA and Necessary Statistical Tests") | |
| st.write(""" | |
| **Agoda_df:** Cleaned dataset with hotel details and key features like ratings, price, reviews, cashback, discounts, and free services. | |
| The EDA will involve the following steps: | |
| - **Summary Statistics:** Analyze the central tendency, spread, and shape of the distribution of each feature. | |
| - **Data Distribution:** Visualize the distribution of key features like price, ratings, reviews, cashback, etc. | |
| - **Correlation Analysis:** Analyze relationships between numeric features like price, ratings, reviews, cashback, etc. | |
| - **Categorical Data Analysis:** Explore categorical variables like hotel category, cancellation policy, state, and location using frequency tables and visualizations. | |
| - **Missing Value Analysis:** Ensure no missing values remain, and check the need for imputations. | |
| - **Outlier Detection:** Identify any outliers that may skew the analysis or predictions. | |
| - **Statistical Tests:** Apply appropriate statistical tests to identify significant differences or relationships (e.g., t-tests for comparing means, chi-squared for categorical variables). | |
| """) | |
| # Placeholder for further detailed code or visualizations | |
| st.write("Further steps will include generating visualizations and statistical tests to explore relationships between features in more detail.") | |
| # Check if the cleaned data exists in session state | |
| if 'cleaned_data' in st.session_state: | |
| data = st.session_state.cleaned_data | |
| # Display the cleaned data on the second page | |
| st.subheader("Cleaned Data from Page 1:") | |
| st.write(data) | |
| st.write(""" | |
| ### **Dataset Information** | |
| The cleaned dataset has been successfully accessed from the session state. | |
| """) | |
| st.subheader("Dataset Preview:") | |
| st.write(data) # Display the first 5 rows | |
| st.subheader("Info of the Dataset:") | |
| # Redirect the output of df.info() to a string buffer | |
| buffer = StringIO() | |
| data.info(buf=buffer) | |
| # Display the content in Streamlit as Markdown | |
| st.subheader("Info of the Dataset:") | |
| st.markdown(f"```{buffer.getvalue()}```") | |
| st.subheader("Dataset Statistics:") | |
| st.write(data.describe()) | |
| st.subheader("Dataset Shape (Rows, Columns):") | |
| st.write(data.shape) | |
| data= data[data["price"] <= 40000] # Keep rows where price is less than or equal to 40,000 | |
| ### univariate_analysis. | |
| st.success("Dataset successfully loaded from session state!") | |
| st.subheader("Univariate Analysis") | |
| # Rating and Review Text Distribution | |
| st.subheader("Rating and Review Text Distribution") | |
| fig, axs = plt.subplots(1, 2, figsize=(16, 6)) | |
| data["rating"].value_counts().plot(kind='pie', title='Distribution of Ratings', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[0]) | |
| axs[0].set_title("Distribution of Ratings") | |
| data['review_text'].value_counts().plot(kind='pie', title='Distribution of Review Text', autopct='%1.1f%%', shadow=True, startangle=45, textprops={'size': 'x-large'}, ax=axs[1]) | |
| axs[1].set_title("Distribution of Review Text") | |
| st.pyplot(fig) | |
| # Hotel Star Insights | |
| st.write(""" | |
| **Insight:** | |
| - Majority of hotels in this data are 3-star hotels. | |
| - Frequency of 4-star and 5-star hotels are also moderately good. | |
| - 1-star and 2-star hotels are lower in frequency. | |
| """) | |
| # Price, Cashback, and Discount Distribution | |
| st.subheader("Price, Cashback, and Discount Distribution") | |
| fig, axs = plt.subplots(1, 3, figsize=(16, 6)) | |
| sns.histplot(data=data, x='price', color='green', kde=True, ax=axs[0]) | |
| axs[0].set_title("Count based on price") | |
| axs[0].set_xlabel('Price') | |
| axs[0].set_ylabel('Number of People') | |
| sns.histplot(data=data, x='cashback', color='violet', kde=True, ax=axs[1]) | |
| axs[1].set_title("Count based on cashback") | |
| axs[1].set_xlabel('Cashback') | |
| axs[1].set_ylabel('Number of People') | |
| sns.histplot(data=data, x='discount', color='orange', kde=True, ax=axs[2]) | |
| axs[2].set_title("Count based on discount") | |
| axs[2].set_xlabel('Discount') | |
| st.pyplot(fig) | |
| # Histogram Insights | |
| st.subheader("Plot-wise Analysis of Histograms") | |
| st.write(""" | |
| **Price Distribution Insight:** | |
| - The histogram is right-skewed, showing most properties are in the lower price range. | |
| - A long tail indicates the presence of a few very expensive properties. | |
| **Cashback Distribution Insight:** | |
| - The histogram is right-skewed, with the majority of properties offering lower cashback amounts. | |
| - Only a small number of properties provide higher cashback. | |
| **Discount Distribution Insight:** | |
| - The histogram is right-skewed, indicating that most properties offer lower discount percentages. | |
| - A few properties stand out with higher discounts. | |
| **Summary:** | |
| The data suggest that Agoda properties are generally affordable, with lower cashback and discount offers being common. Further statistical analysis could help uncover more detailed insights. | |
| """) | |
| # Cancellation and State Distribution | |
| st.subheader("Cancellation and State Distribution") | |
| fig, axs = plt.subplots(1, 2, figsize=(16, 6)) | |
| data["cancellation"].value_counts().plot(kind='bar', title='Distribution of Cancellation', color='red', ax=axs[0]) | |
| axs[0].set_title("Distribution of Cancellation") | |
| axs[0].set_xlabel('Cancellation') | |
| axs[0].set_ylabel('Number of Hotels') | |
| data["state"].value_counts().plot(kind='bar', title='Distribution of State', color='black', ax=axs[1]) | |
| axs[1].set_title("Distribution of State") | |
| axs[1].set_xlabel('State') | |
| axs[1].set_ylabel('Number of Hotels') | |
| st.pyplot(fig) | |
| # Bar Chart Insights | |
| st.subheader("Plot Wise Analysis of Bar Charts") | |
| st.write(""" | |
| **Cancellations:** | |
| - Most cancellations fall under category "1," indicating they occur within specific conditions or timeframes. | |
| **State Distribution:** | |
| - "Maharashtra" has the highest number of hotels, followed by "Madhya Pradesh." | |
| - Other states like Gujarat, Karnataka, and Kerala also have notable hotel counts. | |
| - The distribution is uneven, with some states having significantly more hotels. | |
| **Summary:** | |
| The charts highlight cancellation trends and the regional hotel distribution in India. | |
| """) | |
| # Category and Reviews Distribution | |
| st.subheader("Category and Reviews Distribution") | |
| fig, axs = plt.subplots(1, 2, figsize=(16, 6)) | |
| colors = sns.color_palette('Set2', n_colors=len(data["category"].value_counts())) | |
| data["category"].value_counts().plot(kind='bar', ax=axs[0], color=colors) | |
| axs[0].set_title("Distribution of Category") | |
| axs[0].set_xlabel('Category') | |
| axs[0].set_ylabel('Number of Hotels') | |
| sns.histplot(data=data, x='reviews', color='violet', kde=True, ax=axs[1]) | |
| axs[1].set_title("Count based on Reviews") | |
| axs[1].set_xlabel('Reviews') | |
| axs[1].set_ylabel('Number of Reviews') | |
| st.pyplot(fig) | |
| # Hotel Categories and Reviews Insights | |
| st.subheader("Plot Wise Analysis of Hotel Categories and Reviews") | |
| st.write(""" | |
| **Category Distribution:** | |
| - The histogram shows "Low Budget" hotels are the most common, followed by "Budget Hotels," while "Luxury Hotels" are the least common. | |
| **Review Count Distribution:** | |
| - The histogram is right-skewed, with most hotels having a low number of reviews. | |
| - A few hotels have a very high number of reviews, evident from the long tail. | |
| **Summary:** | |
| The data indicates a higher concentration of low-budget hotels and relatively low review counts for most hotels. | |
| """) | |
| if 'free_services' in data.columns: | |
| # Convert non-string values to strings and handle missing values | |
| data['free_services'] = data['free_services'].fillna('').astype(str) | |
| # Perform the string operations | |
| amenity_counts = ( | |
| data['free_services'] | |
| .str.split(',') # Split the strings by commas | |
| .explode() # Flatten the lists into individual rows | |
| .str.strip() # Remove leading/trailing spaces | |
| .value_counts() # Count occurrences | |
| .reset_index() # Convert to a DataFrame | |
| ) | |
| amenity_counts.columns = ['Amenity', 'Count'] # Rename columns for clarity | |
| st.write(amenity_counts) | |
| else: | |
| st.write("The 'free_services' column does not exist in the dataset.") | |
| # Top Amenities Insights | |
| st.subheader("Plot Wise Analysis of Top Amenities") | |
| st.write(""" | |
| **Common Amenities:** | |
| - Complimentary Parking is the most frequently offered amenity. | |
| - Basic Toiletries and Hair Dryers are also widely available. | |
| **Less Common Amenities:** | |
| - Fitness Center Access, Welcome Drinks, and Turndown Service are less common. | |
| - Shoe Shine Service is the least frequently offered amenity. | |
| **Summary:** | |
| Hotels tend to prioritize basic amenities like parking, toiletries, and hair dryers, while luxurious amenities are offered less frequently. | |
| """) | |
| st.title("Bivariate Analysis") | |
| # Price vs Rating scatter plot | |
| st.subheader("Price vs Rating") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.scatterplot(x='rating', y='price', data=data, color='orange') | |
| ax.set_title('Price vs Rating') | |
| ax.set_xlabel('Rating') | |
| ax.set_ylabel('Price') | |
| st.pyplot(fig) | |
| # Price vs Rating | |
| st.subheader("Price vs Rating:") | |
| st.write(""" | |
| - **Analysis:** | |
| - Higher-priced hotels slightly tend to have better ratings, but ratings vary widely across price points. | |
| - Hotels at various price points exhibit a large spread in ratings, meaning factors other than price (such as customer experience or amenities) contribute significantly to the rating. | |
| """) | |
| # Price vs Discount scatter plot | |
| st.subheader("Price vs Discount") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.scatterplot(x='discount', y='price', data=data, color='green') | |
| ax.set_title('Price vs Discount') | |
| ax.set_xlabel('Discount') | |
| ax.set_ylabel('Price') | |
| st.pyplot(fig) | |
| # Price vs Discount | |
| st.subheader("Price vs Discount:") | |
| st.write(""" | |
| - **Analysis:** | |
| - Some high-priced hotels still provide discounts due to promotions or special deals. | |
| - This observation suggests that while premium hotels may not always need to offer discounts to attract customers, they occasionally use them as a marketing strategy or for seasonal promotions. | |
| """) | |
| # Price vs Cashback scatter plot | |
| st.subheader("Price vs Cashback") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.scatterplot(x='cashback', y='price', data=data, color='blue') | |
| ax.set_title('Price vs Cashback') | |
| ax.set_xlabel('Cashback') | |
| ax.set_ylabel('Price') | |
| st.pyplot(fig) | |
| # Price vs Cashback | |
| st.subheader("Price vs Cashback:") | |
| st.write(""" | |
| - **Analysis:** | |
| - Exceptions exist due to promotional campaigns. | |
| - While higher-priced hotels generally offer fewer cashback incentives, some may offer cashback due to specific promotional campaigns aimed at increasing sales volume or attracting customers in a competitive market. | |
| """) | |
| # Price vs Category bar plot | |
| st.subheader("Price vs Category") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.barplot(x='category', y='price', data=data, palette='Set2') | |
| ax.set_title('Price vs Category') | |
| ax.set_xlabel('Category') | |
| ax.set_ylabel('Price') | |
| st.pyplot(fig) | |
| # Price vs Category | |
| st.subheader("Price vs Category:") | |
| st.write(""" | |
| - **Analysis:** | |
| - "Luxury" hotels have the highest prices, followed by "Premium" and "Free & Easy." | |
| - "Low Budget" and "Budget" hotels occupy the lower price range, showing that the category directly influences pricing strategy. | |
| - Categories like "Luxury" and "Premium" aim to target a specific market willing to pay more for superior quality and services, while "Budget" and "Low Budget" cater to a price-sensitive segment. | |
| """) | |
| # Summary of Scatter Plot Analysis | |
| st.subheader("Summary of Scatter Plot Analysis:") | |
| st.write(""" | |
| - **Price vs Rating:** Higher-priced hotels generally offer better ratings, but the ratings vary widely across price points, indicating that factors such as service quality and amenities matter significantly. | |
| - **Price vs Discount:** High-priced hotels may still offer discounts due to seasonal promotions or special offers. | |
| - **Price vs Cashback:** Although high-priced hotels generally offer fewer cashback incentives, there are exceptions driven by promotional campaigns. | |
| - **Price vs Category:** "Luxury" hotels are the most expensive, followed by "Premium" and "Free & Easy" categories. On the other hand, "Low Budget" and "Budget" hotels have lower prices. | |
| - **Overall Insight:** The scatter plots reveal trends where higher-priced hotels tend to offer better ratings but fewer discounts and cashback incentives, while lower-priced categories tend to provide more promotional benefits such as discounts and cashback. | |
| """) | |
| # Rating vs Category bar plot | |
| st.subheader("Rating vs Category") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.barplot(x='category', y='rating', data=data, palette='Set1') | |
| ax.set_title('Rating vs Category') | |
| ax.set_xlabel('Category') | |
| ax.set_ylabel('Rating') | |
| st.pyplot(fig) | |
| # Rating vs Category | |
| st.subheader("Rating vs Category:") | |
| st.write(""" | |
| - **Analysis:** | |
| - "Luxury" hotels lead in average ratings, followed by "Premium" hotels. | |
| - "Budget" and "Low Budget" categories show lower average ratings, indicating that these hotels may focus on price rather than offering premium services or experiences. | |
| - The disparity in ratings shows that customers tend to have higher expectations for luxury and premium accommodations, which are reflected in the ratings. | |
| """) | |
| # Discount vs Category box plot | |
| st.subheader("Discount vs Category") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.boxplot(x='category', y='discount', data=data, palette='Set2') | |
| ax.set_title('Discount vs Category') | |
| ax.set_xlabel('Category') | |
| ax.set_ylabel('Discount') | |
| st.pyplot(fig) | |
| # Discount vs Category | |
| st.subheader("Discount vs Category:") | |
| st.write(""" | |
| - **Analysis:** | |
| - "Low Budget" hotels offer the highest discounts, while "Luxury" hotels provide the least discounts. | |
| - This suggests that budget-friendly hotels use discounts as a key strategy to attract price-sensitive customers, whereas luxury hotels focus on providing premium experiences without relying on price reductions. | |
| - The discount strategy varies by category, with lower-priced categories incentivizing customers with discounts to stay competitive. | |
| """) | |
| # Cashback vs Category violin plot | |
| st.subheader("Cashback vs Category") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.violinplot(x='category', y='cashback', data=data, palette='Set3') | |
| ax.set_title('Cashback vs Category') | |
| ax.set_xlabel('Category') | |
| ax.set_ylabel('Cashback') | |
| st.pyplot(fig) | |
| # Cashback vs Category | |
| st.subheader("Cashback vs Category:") | |
| st.write(""" | |
| - **Analysis:** | |
| - Higher cashback offers are more common in "Low Budget" hotels, as these hotels rely on cashback incentives to attract customers looking for value deals. | |
| - Luxury hotels rarely provide cashback, as their target market is less likely to be motivated by such offers. | |
| - The trend highlights the different strategies employed by each category: budget options often provide financial incentives like cashback to drive bookings, while luxury options focus on premium services. | |
| """) | |
| # Reviews vs Category count plot | |
| st.subheader("Reviews vs Category") | |
| fig, ax = plt.subplots(figsize=(7, 5)) | |
| sns.countplot(x='category', data=data, palette='Set1') | |
| ax.set_title('Reviews vs Category (Count)') | |
| ax.set_xlabel('Category') | |
| ax.set_ylabel('Count of Reviews') | |
| st.pyplot(fig) | |
| # Reviews vs Category | |
| st.subheader("Reviews vs Category:") | |
| st.write(""" | |
| - **Analysis:** | |
| - "Luxury" hotels attract the most reviews, indicating that higher-quality accommodations often receive more feedback from customers. | |
| - "Budget" and "Low Budget" hotels tend to have fewer reviews, which may be due to their more straightforward offerings and smaller customer base. | |
| - This trend suggests that customers who opt for luxury hotels are more likely to share their experiences, whereas budget options may attract fewer repeat customers or have less word-of-mouth influence. | |
| """) | |
| # Summary of Bar and Box Plot Analysis | |
| st.subheader("Summary of Bar and Box Plot Analysis:") | |
| st.write(""" | |
| - **Rating vs Category:** "Luxury" and "Premium" hotels have higher ratings on average, while "Budget" and "Low Budget" hotels show lower ratings. | |
| - **Discount vs Category:** Budget hotels, especially "Low Budget," offer higher discounts, while luxury hotels offer fewer discounts, relying more on their value proposition. | |
| - **Cashback vs Category:** "Low Budget" hotels offer higher cashback incentives, while luxury hotels rarely provide cashback, highlighting the pricing strategies in different categories. | |
| - **Reviews vs Category:** "Luxury" hotels attract the most reviews, while "Budget" and "Low Budget" hotels attract fewer. | |
| - **Overall Insight:** Bar and box plots reveal that higher-rated and more reviewed hotels tend to offer fewer discounts or cashback, focusing on customer experience, while budget categories focus on providing customer incentives to compete in the market. | |
| """) | |
| # Regional price analysis by state | |
| st.subheader("Price by State") | |
| fig, ax = plt.subplots(figsize=(16, 6)) | |
| sns.barplot(data=data, x='state', y='price', ax=ax, color='green') | |
| ax.set_title('Price by State') | |
| ax.tick_params(axis='x', rotation=90) | |
| sns.set_palette('magma') | |
| plt.tight_layout() | |
| st.pyplot(fig) | |
| # Hotel Prices Across Indian States | |
| st.subheader("Hotel Prices Across Indian States:") | |
| st.write(""" | |
| - **Analysis:** | |
| - Hotel prices vary significantly across Indian states, reflecting regional differences in demand and supply. | |
| - States with popular tourist destinations, such as Goa and Rajasthan, tend to show higher hotel prices, as they attract more visitors and have a higher demand for accommodations. | |
| - Conversely, states with less tourism or lower demand may exhibit more affordable pricing, catering to the local population or budget travelers. | |
| - Price differences also reflect factors such as local economic conditions, infrastructure, and tourism policies in each state. | |
| """) | |
| # Regional category count by state | |
| st.subheader("Category by State") | |
| fig, ax = plt.subplots(figsize=(16, 6)) | |
| sns.countplot(data=data, x='state', hue='category', ax=ax, palette='Set1') | |
| ax.set_title('Category by State') | |
| ax.tick_params(axis='x', rotation=90) | |
| plt.tight_layout() | |
| st.pyplot(fig) | |
| # Hotel Categories by State | |
| st.subheader("Hotel Categories by State:") | |
| st.write(""" | |
| - **Analysis:** | |
| - States with a higher concentration of "Low Budget" and "Budget" hotels cater primarily to cost-conscious travelers, offering affordable accommodations for a wide range of customers. | |
| - States with more "Luxury" hotels are likely to be major tourist hubs, such as Delhi, Mumbai, and Kerala, or regions that cater to premium audiences, offering high-end services for affluent customers. | |
| - These states may also focus on attracting international tourists or business travelers who prefer premium amenities and luxury experiences. | |
| """) | |
| # Summary of Regional Price and Category Trends | |
| st.subheader("Summary of Regional Price and Category Trends:") | |
| st.write(""" | |
| - **Hotel Prices Across Indian States:** Prices vary significantly depending on the state, with tourist-heavy regions showing higher prices due to greater demand. | |
| - **Hotel Categories by State:** States with more budget hotels focus on catering to price-sensitive travelers, while states with luxury hotels cater to premium audiences, often in tourist hotspots. | |
| - **Overall Insight:** Regional trends indicate diverse pricing and category distributions, influenced by tourism, regional economics, and state-specific factors that shape hotel offerings across the country. | |
| """) | |
| st.title("Multivariate Analysis of Hotel Data") | |
| # Create a subset of the data for the analysis | |
| subset_data = data[['category', 'price', 'reviews', 'discount', 'cashback', 'rating']] | |
| # Section 1: Price vs. Reviews by Category | |
| st.header("Price vs. Reviews by Category") | |
| fig1 = sns.catplot(data=data, x='reviews', y='price', hue='category', kind='strip', palette='Set2', height=6, aspect=1.5) | |
| fig1.set_axis_labels("Reviews", "Price") | |
| fig1.fig.suptitle('Price vs Reviews by Category', fontsize=16) | |
| st.pyplot(fig1) | |
| # Analysis Text for Price vs. Reviews by Category | |
| st.write(""" | |
| - **Price Variation within Categories:** | |
| - Wide price ranges exist within each category, with "Low Budget" hotels featuring both low- and high-priced options. | |
| - This shows that pricing within categories isn't always uniform and may depend on other factors like location, amenities, and hotel size. | |
| - **Price and Reviews Relationship:** | |
| - There's a slight tendency for hotels with more reviews to have higher prices, possibly due to the influence of popularity, better marketing efforts, or higher quality services. | |
| - **Summary:** | |
| - The stripplot reveals a weak positive correlation between price and reviews, indicating that well-reviewed hotels tend to have higher prices. | |
| """) | |
| # Section 2: Price vs. Discount by Category | |
| st.header("Price vs. Discount by Category") | |
| fig2 = sns.catplot(data=data, x='discount', y='price', hue='category', kind='bar', palette='Set2', height=6, aspect=1.5) | |
| fig2.set_axis_labels("Discount", "Price") | |
| fig2.fig.suptitle('Price vs Discount by Category', fontsize=16) | |
| st.pyplot(fig2) | |
| # Analysis Text for Price vs. Discount by Category | |
| st.write(""" | |
| - **Price and Discount Relationship:** | |
| - The stripplot clearly shows that as hotel prices increase, discounts tend to decrease. This confirms a negative correlation between price and discount. | |
| - **Category-Specific Trends:** | |
| - "Low Budget" and "Budget" hotels offer much higher discounts compared to "Premium" and "Luxury" hotels, which typically offer fewer or smaller discounts. | |
| - This trend highlights that budget-conscious categories use higher discounts to attract customers, whereas premium and luxury categories rely on factors other than discounts (e.g., quality, exclusivity) to appeal to their clientele. | |
| - **Summary:** | |
| - The plot confirms that lower-priced hotels use higher discounts to attract customers, while premium and luxury hotels maintain lower discount rates, aligning with typical market behavior. | |
| """) | |
| # Section 3: Price vs Cashback and Rating by Category (Stripplot) | |
| st.header("Price vs Cashback and Rating by Category") | |
| fig3, axes2 = plt.subplots(1, 2, figsize=(16, 6)) | |
| sns.stripplot(data=data, x='cashback', y='price', hue='category', ax=axes2[0], palette='Set2', jitter=True, dodge=True) | |
| axes2[0].set_title('Price vs Cashback by Category') | |
| sns.stripplot(data=data, x='rating', y='price', hue='category', ax=axes2[1], palette='Set2', jitter=True, dodge=True) | |
| axes2[1].set_title('Price vs Rating by Category') | |
| st.pyplot(fig3) | |
| # Analysis Text for Price vs Cashback by Category | |
| st.write(""" | |
| - **Price vs Cashback by Category:** | |
| - The plot shows that cashback incentives tend to decrease as hotel prices increase. However, variations exist due to promotional offers that may affect cashback amounts. | |
| - Lower-priced categories, such as "Budget" and "Low Budget," offer higher cashback incentives to attract price-sensitive customers. | |
| - **Summary:** | |
| - The stripplot reveals a clear trend where lower-priced hotels use cashback as an incentive to boost bookings, whereas higher-priced hotels focus on other value propositions (e.g., premium services). | |
| """) | |
| # Analysis Text for Price vs Rating by Category | |
| st.write(""" | |
| - **Price vs Rating by Category:** | |
| - Higher-priced categories like "Premium" and "Luxury" tend to have better ratings, indicating that customers perceive these hotels as providing superior value. | |
| - Interestingly, some lower-priced hotels achieve high ratings, suggesting that other factors such as service quality and customer experience may contribute to higher satisfaction despite the lower price point. | |
| - **Summary:** | |
| - The stripplot shows that while higher-priced hotels emphasize quality to achieve better ratings, lower-priced hotels still manage to deliver satisfactory experiences for customers through factors like service quality. | |
| """) | |
| # Section 4: Correlation Heatmap | |
| st.header("Correlation Matrix Heatmap") | |
| numeric_columns = ['price', 'reviews', 'discount', 'cashback', 'rating'] | |
| # Compute the correlation matrix | |
| correlation_matrix = data[numeric_columns].corr() | |
| # Create a heatmap to visualize the correlation matrix | |
| plt.figure(figsize=(10, 8)) | |
| sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1) | |
| # Set title for the plot | |
| plt.title('Correlation Matrix Heatmap') | |
| # Display the plot | |
| st.pyplot(plt) | |
| # Analysis Text for Correlation Heatmap | |
| st.write(""" | |
| - **Price vs Discount/Cashback:** | |
| - The heatmap shows a strong negative correlation between price and both discount and cashback. Higher-priced hotels tend to offer fewer discounts and cashback incentives, suggesting that premium offerings rely on value rather than incentives. | |
| - **Price vs Rating:** | |
| - A weak positive correlation is observed between price and rating. Higher-priced hotels generally have slightly better ratings, although this relationship is not very strong. | |
| - **Reviews vs Rating:** | |
| - A moderate positive correlation exists between reviews and ratings. Hotels that attract more reviews tend to have better ratings, likely due to a larger customer base providing feedback. | |
| - **Reviews vs Price:** | |
| - A weak positive correlation between reviews and price suggests that well-reviewed hotels are often priced higher, likely due to their popularity and perceived value. | |
| - **Summary:** | |
| - The heatmap provides insights into the relationships between key variables in the dataset. Higher-priced hotels tend to offer fewer discounts and cashback offers but generally have better ratings and more reviews, indicating that customers are willing to pay a premium for a better experience. | |
| """) | |
| st.header("Overall Summary") | |
| st.write(""" | |
| - Most properties are affordable, with lower prices, cashback, and discounts dominating the dataset. | |
| - Regional distribution shows states like Maharashtra and Madhya Pradesh having more hotels. | |
| - The data reflects a market focused on affordability and basic amenities, with regional and category-specific variations. | |
| - Cancellations and reviews provide further insights into customer behavior, while skewed distributions highlight potential outliers and trends in pricing and service offerings. | |
| """) | |
| st.header("Why Right-Skewed Trends Are Normal, Not Outliers") | |
| st.write(""" | |
| - Right-skewed distributions for price, cashback, discounts, cancellations, reviews, and amenities are normal trends in the market. | |
| - These trends represent the expected distribution of data where higher values are less frequent but are not considered outliers. | |
| - The variations in cancellation patterns and review counts reflect typical customer behavior and industry dynamics. | |
| """) | |
| st.write(""" | |
| Since no outliers were detected, we can proceed with model training and selection. | |
| With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance. | |
| """) | |
| # Title for Streamlit App | |
| st.title("Feature Engineering on Dataset") | |
| # Feature Engineering Explanation | |
| st.markdown(""" | |
| ### What is Feature Engineering? | |
| Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models. | |
| It involves techniques such as: | |
| - Encoding categorical variables into numerical values | |
| - Handling class imbalances | |
| - Selecting and transforming features to enhance model accuracy | |
| In this app, we will apply feature engineering techniques to prepare the dataset for analysis and modeling. | |
| """) | |
| # Load Dataset (you can replace this with actual file upload) | |
| # Read the dataset | |
| # Creating a working copy of the dataset | |
| st.subheader("Creating a Working Copy of the Dataset") | |
| st.markdown(""" | |
| To ensure that the original dataset remains intact, we create a working copy of the dataset named `df`. | |
| This allows us to make transformations and modifications without altering the uploaded data. | |
| """) | |
| df = data.copy() | |
| st.subheader("Dataset Preview:") | |
| st.write(df) # Display the first 5 rows | |
| st.subheader("Info of the Dataset:") | |
| # Redirect the output of df.info() to a string buffer | |
| buffer = StringIO() | |
| df.info(buf=buffer) | |
| # Display the content in Streamlit | |
| st.write(buffer.getvalue()) | |
| st.subheader("Dataset Statistics:") | |
| st.write(df.describe()) | |
| st.subheader("Dataset Shape (Rows, Columns):") | |
| st.write(df.shape) | |
| # Checking the number of categories in the 'category' column | |
| st.subheader("Category Distribution") | |
| st.markdown(""" | |
| **Step 1: Mapping Categories** | |
| - The `category` column contains categorical data representing different types of hotels (e.g., Low Budget, Luxury). | |
| - These categories are mapped to numerical values to make them suitable for machine learning models. | |
| """) | |
| st.write("Category Value Counts (Before Mapping):") | |
| st.write(df["category"].value_counts()) | |
| # Mapping Agoda hotel categories to numerical values | |
| category_mapping = { | |
| "Low Budget": 0, | |
| "Budget Hotels": 1, | |
| "Mid-Range Hotels": 2, | |
| "Premium Hotels": 3, | |
| "Luxury Hotels": 4, | |
| } | |
| df["category"] = df["category"].map(category_mapping) | |
| st.write("Category Value Counts (After Mapping):") | |
| st.write(df["category"].value_counts()) | |
| # Encoding 'state' column | |
| st.subheader("State Encoding") | |
| st.markdown(""" | |
| **Step 2: Encoding States** | |
| - The `state` column contains categorical location data. | |
| - We encode it into numerical values using `astype('category').cat.codes`, where each unique state is assigned a unique integer. | |
| """) | |
| st.write("State Value Counts (Before Encoding):") | |
| st.write(df["state"].value_counts()) | |
| df["state"] = df["state"].astype("category").cat.codes | |
| st.write("State Value Counts (After Encoding):") | |
| st.write(df["state"].value_counts()) | |
| # Splitting the dataset into feature and target variables | |
| st.subheader("Splitting Features and Target") | |
| st.markdown(""" | |
| **Step 3: Splitting the Dataset** | |
| - The dataset is split into two parts: | |
| - **Feature Variables (X):** Attributes used for prediction. | |
| - **Target Variable (y):** The value we want to predict (e.g., `price` or `category`). | |
| """) | |
| feature_variables = df.iloc[:, 0:-1] # Feature variables | |
| target_variable = df.iloc[:, -1] # Target variable | |
| st.write("Feature Variables (from Dataset):", feature_variables.head()) | |
| st.write("Target Variable (from Dataset):", target_variable.head()) | |
| # Selecting specific features for analysis | |
| X = feature_variables[["rating", "reviews", "cashback", "discount", "state", "price"]] | |
| y = target_variable | |
| st.write("Selected Features (X) from Dataset:", X.head()) | |
| st.write("Target Variable (y) from Dataset:", y.head()) | |
| # Balancing the Dataset with SMOTE | |
| st.subheader("Balancing Dataset with SMOTE") | |
| st.markdown(""" | |
| **Step 4: Handling Imbalanced Classes** | |
| - Imbalanced datasets can bias models towards majority classes. | |
| - We use **SMOTE (Synthetic Minority Over-sampling Technique)** to generate synthetic samples for underrepresented classes, ensuring a balanced dataset. | |
| """) | |
| smote = SMOTE(random_state=42) | |
| X_res, y_res = smote.fit_resample(X, y) | |
| st.write("Balanced Dataset (X_res):", X_res.head()) | |
| st.write("Balanced Target Variable (y_res) Distribution:") | |
| st.write(y_res.value_counts()) | |
| # Save data in session state | |
| st.session_state['X_res'] = X_res | |
| st.session_state['y_res'] = y_res | |
| else: | |
| st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.") | |
| st.markdown( | |
| """ | |
| <style> | |
| .custom-button { | |
| display: inline-block; | |
| padding: 5px 10px; | |
| font-size: 14px; | |
| color: #ffffff; | |
| background-color: #4CAF50; | |
| border: none; | |
| border-radius: 5px; | |
| text-align: center; | |
| text-decoration: none; | |
| transition: background-color 0.3s ease, transform 0.2s ease; | |
| cursor: pointer; | |
| } | |
| .custom-button:hover { | |
| background-color: #45a049; | |
| transform: scale(1.05); | |
| } | |
| .button-container { | |
| display: flex; | |
| justify-content: space-between; | |
| margin-top: 20px; | |
| } | |
| </style> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| # Navigation Buttons | |
| st.markdown( | |
| """ | |
| <div class="button-container"> | |
| <a href="pages/2_Data_Cleaning_and_Processing" target="_self" class="custom-button">Previous ⏮️</a> | |
| <a href="pages/4_Model_Creation_and_Evaluation" target="_self" class="custom-button">Next ⏭️</a> | |
| </div> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |