Mpavan45 commited on
Commit
536bd2c
·
verified ·
1 Parent(s): c07bfea

Update pages/3_EDA and Feature Engineering.py

Browse files
pages/3_EDA and Feature Engineering.py CHANGED
@@ -565,6 +565,116 @@ if data is not None:
565
  Since no outliers were detected, we can proceed with model training and selection.
566
  With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
567
  """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
568
 
569
  else:
570
  st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")
 
565
  Since no outliers were detected, we can proceed with model training and selection.
566
  With clean data, we can now focus on choosing the best algorithm, tuning hyperparameters, and evaluating model performance.
567
  """)
568
+ # Title for Streamlit App
569
+ st.title("Feature Engineering on Dataset")
570
+
571
+ # Feature Engineering Explanation
572
+ st.markdown("""
573
+ ### What is Feature Engineering?
574
+ Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models.
575
+ It involves techniques such as:
576
+ - Encoding categorical variables into numerical values
577
+ - Handling class imbalances
578
+ - Selecting and transforming features to enhance model accuracy
579
+
580
+ In this app, we will apply feature engineering techniques to prepare the dataset for analysis and modeling.
581
+ """)
582
+
583
+ # Load Dataset (you can replace this with actual file upload)
584
+ # Read the dataset
585
+ # Creating a working copy of the dataset
586
+ st.subheader("Creating a Working Copy of the Dataset")
587
+ st.markdown("""
588
+ To ensure that the original dataset remains intact, we create a working copy of the dataset named `df`.
589
+ This allows us to make transformations and modifications without altering the uploaded data.
590
+ """)
591
+
592
+ df = data.copy()
593
+
594
+ st.subheader("Dataset Preview:")
595
+ st.write(df) # Display the first 5 rows
596
+
597
+
598
+ st.subheader("Info of the Dataset:")
599
+ # Redirect the output of df.info() to a string buffer
600
+ buffer = StringIO()
601
+ df.info(buf=buffer)
602
+
603
+ # Display the content in Streamlit
604
+ st.write(buffer.getvalue())
605
+
606
+ st.subheader("Dataset Statistics:")
607
+ st.write(df.describe())
608
+
609
+ st.subheader("Dataset Shape (Rows, Columns):")
610
+ st.write(df.shape)
611
+
612
+ # Checking the number of categories in the 'category' column
613
+ st.subheader("Category Distribution")
614
+ st.markdown("""
615
+ **Step 1: Mapping Categories**
616
+ - The `category` column contains categorical data representing different types of hotels (e.g., Low Budget, Luxury).
617
+ - These categories are mapped to numerical values to make them suitable for machine learning models.
618
+ """)
619
+ st.write("Category Value Counts (Before Mapping):")
620
+ st.write(df["category"].value_counts())
621
+
622
+ # Mapping Agoda hotel categories to numerical values
623
+ category_mapping = {
624
+ "Low Budget": 0,
625
+ "Budget Hotels": 1,
626
+ "Mid-Range Hotels": 2,
627
+ "Premium Hotels": 3,
628
+ "Luxury Hotels": 4,
629
+ }
630
+ df["category"] = df["category"].map(category_mapping)
631
+ st.write("Category Value Counts (After Mapping):")
632
+ st.write(df["category"].value_counts())
633
+
634
+ # Encoding 'state' column
635
+ st.subheader("State Encoding")
636
+ st.markdown("""
637
+ **Step 2: Encoding States**
638
+ - The `state` column contains categorical location data.
639
+ - We encode it into numerical values using `astype('category').cat.codes`, where each unique state is assigned a unique integer.
640
+ """)
641
+ st.write("State Value Counts (Before Encoding):")
642
+ st.write(df["state"].value_counts())
643
+ df["state"] = df["state"].astype("category").cat.codes
644
+ st.write("State Value Counts (After Encoding):")
645
+ st.write(df["state"].value_counts())
646
+
647
+ # Splitting the dataset into feature and target variables
648
+ st.subheader("Splitting Features and Target")
649
+ st.markdown("""
650
+ **Step 3: Splitting the Dataset**
651
+ - The dataset is split into two parts:
652
+ - **Feature Variables (X):** Attributes used for prediction.
653
+ - **Target Variable (y):** The value we want to predict (e.g., `price` or `category`).
654
+ """)
655
+ feature_variables = df.iloc[:, 0:-1] # Feature variables
656
+ target_variable = df.iloc[:, -1] # Target variable
657
+ st.write("Feature Variables (from Dataset):", feature_variables.head())
658
+ st.write("Target Variable (from Dataset):", target_variable.head())
659
+
660
+ # Selecting specific features for analysis
661
+ X = feature_variables[["rating", "reviews", "cashback", "discount", "state", "price"]]
662
+ y = target_variable
663
+ st.write("Selected Features (X) from Dataset:", X.head())
664
+ st.write("Target Variable (y) from Dataset:", y.head())
665
+
666
+ # Balancing the Dataset with SMOTE
667
+ st.subheader("Balancing Dataset with SMOTE")
668
+ st.markdown("""
669
+ **Step 4: Handling Imbalanced Classes**
670
+ - Imbalanced datasets can bias models towards majority classes.
671
+ - We use **SMOTE (Synthetic Minority Over-sampling Technique)** to generate synthetic samples for underrepresented classes, ensuring a balanced dataset.
672
+ """)
673
+ smote = SMOTE(random_state=42)
674
+ X_res, y_res = smote.fit_resample(X, y)
675
+ st.write("Balanced Dataset (X_res):", X_res.head())
676
+ st.write("Balanced Target Variable (y_res) Distribution:")
677
+ st.write(y_res.value_counts())
678
 
679
  else:
680
  st.warning("No dataset found in session state. Please load the dataset into `st.session_state['data']`.")