Spaces:

Mpavan45
/

Hotel_Data_Analysis

Sleeping

App Files Files Community

Mpavan45 commited on Jan 10, 2025

Commit

536bd2c

verified ·

1 Parent(s): c07bfea

Update pages/3_EDA and Feature Engineering.py

Browse files

Files changed (1) hide show

pages/3_EDA and Feature Engineering.py +110 -0

pages/3_EDA and Feature Engineering.py CHANGED Viewed

@@ -565,6 +565,116 @@ if data is not None:
     Since no outliers were detected, we can proceed with model training and selection.
     With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
     """)
 else:
    st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")

     Since no outliers were detected, we can proceed with model training and selection.
     With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
     """)
+    # Title for Streamlit App
+    st.title("Feature Engineering on Dataset")
+    # Feature Engineering Explanation
+    st.markdown("""
+    ### What is Feature Engineering?
+    Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models.
+    It involves techniques such as:
+    - Encoding categorical variables into numerical values
+    - Handling class imbalances
+    - Selecting and transforming features to enhance model accuracy
+    In this app, we will apply feature engineering techniques to prepare the dataset for analysis and modeling.
+    """)
+# Load Dataset (you can replace this with actual file upload)
+    # Read the dataset
+     # Creating a working copy of the dataset
+    st.subheader("Creating a Working Copy of the Dataset")
+    st.markdown("""
+    To ensure that the original dataset remains intact, we create a working copy of the dataset named `df`.
+    This allows us to make transformations and modifications without altering the uploaded data.
+    """)
+    df = data.copy()
+    st.subheader("Dataset Preview:")
+    st.write(df)  # Display the first 5 rows
+    st.subheader("Info of the Dataset:")
+    # Redirect the output of df.info() to a string buffer
+    buffer = StringIO()
+    df.info(buf=buffer)
+    # Display the content in Streamlit
+    st.write(buffer.getvalue())
+    st.subheader("Dataset Statistics:")
+    st.write(df.describe())
+    st.subheader("Dataset Shape (Rows, Columns):")
+    st.write(df.shape)
+    # Checking the number of categories in the 'category' column
+    st.subheader("Category Distribution")
+    st.markdown("""
+    **Step 1: Mapping Categories**
+    - The `category` column contains categorical data representing different types of hotels (e.g., Low Budget, Luxury).
+    - These categories are mapped to numerical values to make them suitable for machine learning models.
+    """)
+    st.write("Category Value Counts (Before Mapping):")
+    st.write(df["category"].value_counts())
+    # Mapping Agoda hotel categories to numerical values
+    category_mapping = {
+        "Low Budget": 0,
+        "Budget Hotels": 1,
+        "Mid-Range Hotels": 2,
+        "Premium Hotels": 3,
+        "Luxury Hotels": 4,
+    }
+    df["category"] = df["category"].map(category_mapping)
+    st.write("Category Value Counts (After Mapping):")
+    st.write(df["category"].value_counts())
+    # Encoding 'state' column
+    st.subheader("State Encoding")
+    st.markdown("""
+    **Step 2: Encoding States**
+    - The `state` column contains categorical location data.
+    - We encode it into numerical values using `astype('category').cat.codes`, where each unique state is assigned a unique integer.
+    """)
+    st.write("State Value Counts (Before Encoding):")
+    st.write(df["state"].value_counts())
+    df["state"] = df["state"].astype("category").cat.codes
+    st.write("State Value Counts (After Encoding):")
+    st.write(df["state"].value_counts())
+    # Splitting the dataset into feature and target variables
+    st.subheader("Splitting Features and Target")
+    st.markdown("""
+    **Step 3: Splitting the Dataset**
+    - The dataset is split into two parts:
+        - **Feature Variables (X):** Attributes used for prediction.
+        - **Target Variable (y):** The value we want to predict (e.g., `price` or `category`).
+    """)
+    feature_variables = df.iloc[:, 0:-1]  # Feature variables
+    target_variable = df.iloc[:, -1]     # Target variable
+    st.write("Feature Variables (from Dataset):", feature_variables.head())
+    st.write("Target Variable (from Dataset):", target_variable.head())
+    # Selecting specific features for analysis
+    X = feature_variables[["rating", "reviews", "cashback", "discount", "state", "price"]]
+    y = target_variable
+    st.write("Selected Features (X) from Dataset:", X.head())
+    st.write("Target Variable (y) from Dataset:", y.head())
+    # Balancing the Dataset with SMOTE
+    st.subheader("Balancing Dataset with SMOTE")
+    st.markdown("""
+    **Step 4: Handling Imbalanced Classes**
+    - Imbalanced datasets can bias models towards majority classes.
+    - We use **SMOTE (Synthetic Minority Over-sampling Technique)** to generate synthetic samples for underrepresented classes, ensuring a balanced dataset.
+    """)
+    smote = SMOTE(random_state=42)
+    X_res, y_res = smote.fit_resample(X, y)
+    st.write("Balanced Dataset (X_res):", X_res.head())
+    st.write("Balanced Target Variable (y_res) Distribution:")
+    st.write(y_res.value_counts())
 else:
    st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")