Yoel125
/

Assignment_2_data_science

Model card Files Files and versions

xet

Community

Yoel125 commited on May 1

Commit

e5548db

verified ·

1 Parent(s): 28e48e2

Update README.md

Browse files

Files changed (1) hide show

README.md +30 -32

README.md CHANGED Viewed

@@ -35,17 +35,17 @@ https://huggingface.co/datasets/drukeroni/airline-satisfaction-analysis
 In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.
 #### 1. Data Cleaning
-selected the 15 features that are most relevant to my research question.
-checked for missing values and decided to drop rows where the target variable (Arrival Delay in Minutes) was missing to keep the data accurate.
-verified that there are no duplicate rows in the dataset.
-standardized all text columns by converting them to lowercase and removing extra spaces.
 #### 2. Descriptive Statistics & Data Structure
-checked the data types and the final shape of the table after cleaning.
-calculated the percentage and count for each category, like Gender, Customer Type, and satisfaction, to see how the data is distributed.
-checked the service rating scales (like Inflight wifi service) to make sure all values are between 1 and 5.
 #### 3. Outlier Detection
-used the IQR (Interquartile Range) method to find outliers in columns like Age, Flight Distance, and delay times.
-calculated the percentage of outliers for each feature to understand how many extreme values exist in the data.
 #### Data Exploration: Answering Key Research Questions through Visualization
 #### Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oIXebKJ0KqMjckAlGjYn-.png)
@@ -74,14 +74,14 @@ Removing the extreme anomalies allows us to visualize the core data patterns tha
 # Part 3: Baseline Regression Modeling
 In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.
 ### 1. Data Preparation & Feature Selection
-defined Arrival Delay in Minutes as the target variable.
-selected 8 key numerical features (like Age and Flight Distance) as predictors.
-dropped missing values again just to be 100% sure the data is completely clean for the model.
 #### 2. Model Training
-splited the data into 80% training and 20% testing sets (using random_state=42 for consistency).
-trained a basic Linear Regression model to learn the relationship between the features and delays.
 #### 3. Performance Evaluation
-evaluated the model's accuracy using MAE, MSE, RMSE, and R-squared.
 #### Baseline Model: Actual vs Predicted Arrival Delays
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/_swJewXEptmPng7GTQgY_.png)
@@ -95,17 +95,17 @@ Other features, such as passenger age or flight distance, show very low importan
 # Part 4: Advanced Feature Engineering & Preprocessing
 In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.
 #### 1. Creating the Classification Target
-converted the regression problem into a classification task by creating a binary feature, is_delayed.
-A flight is labeled as "Delayed" (1) if the arrival delay is greater than 15 minutes, otherwise it is labeled (0).
 #### 2. Categorical Encoding
-transformed all text-based features (like Gender, Class, and Customer Type) into numeric format using One-Hot Encoding.
-used the drop_first=True logic to prevent multi-collinearity, ensuring the model remains statistically stable.
 #### 3. Feature Scaling
-applied StandardScaler to numerical columns such as Age and Flight Distance.
-This normalizes the data so that features with larger scales do not unfairly dominate the model's learning process.
 #### 4. Feature Engineering with Unsupervised Learning (K-Means)
-used K-Means Clustering to group passengers into 3 distinct "Service Profiles" based on their ratings of inflight services (Wi-Fi, Cleanliness, etc.).
-This new feature, passenger_profile, allows the model to understand complex patterns of passenger satisfaction.
 #### 5. Cluster Visualization (PCA)
 To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions.
 This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.
@@ -116,23 +116,21 @@ This shows that passengers are grouped into three distinct "service profiles" ba
 # Part 5: Model Training & Evaluation
 In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
 #### 1. Updated Linear Regression (Refined Baseline)
-re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables.
 This allowed me to see how much the additional feature engineering improved the initial baseline performance.
 ###  2. Decision Tree Regressor
-implemented a Decision Tree model to capture non-linear relationships between the features.
 To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
 ### 3. Random Forest Regressor (Ensemble Method)
-trained a Random Forest model consisting of 100 individual trees.
 By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
-score compared to a single decision tree.
 ### 4. Performance Comparison & Visualization
-created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/rPEwtAabowCwz5O5_FH7O.png)
 we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
-Part 7: Regression-to-Classification
-In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
-# Part 7
 ### 7.1 Creating Classes from Numeric Target
 Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
 Threshold Selection:defined the cutoff point at 0 minutes.

 In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.
 #### 1. Data Cleaning
+Selected the 15 features that are most relevant to my research question.
+Checked for missing values and decided to drop rows where the target variable (Arrival Delay in Minutes) was missing to keep the data accurate.
+Verified that there are no duplicate rows in the dataset.
+Standardized all text columns by converting them to lowercase and removing extra spaces.
 #### 2. Descriptive Statistics & Data Structure
+Checked the data types and the final shape of the table after cleaning.
+Calculated the percentage and count for each category, like Gender, Customer Type, and satisfaction, to see how the data is distributed.
+Checked the service rating scales (like Inflight wifi service) to make sure all values are between 1 and 5.
 #### 3. Outlier Detection
+Used the IQR (Interquartile Range) method to find outliers in columns like Age, Flight Distance, and delay times.
+Calculated the percentage of outliers for each feature to understand how many extreme values exist in the data.
 #### Data Exploration: Answering Key Research Questions through Visualization
 #### Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oIXebKJ0KqMjckAlGjYn-.png)
 # Part 3: Baseline Regression Modeling
 In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.
 ### 1. Data Preparation & Feature Selection
+Defined Arrival Delay in Minutes as the target variable.
+Selected 8 key numerical features (like Age and Flight Distance) as predictors.
+Dropped missing values again just to be 100% sure the data is completely clean for the model.
 #### 2. Model Training
+Splited the data into 80% training and 20% testing sets (using random_state=42 for consistency).
+Trained a basic Linear Regression model to learn the relationship between the features and delays.
 #### 3. Performance Evaluation
+Evaluated the model's accuracy using MAE, MSE, RMSE, and R-squared.
 #### Baseline Model: Actual vs Predicted Arrival Delays
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/_swJewXEptmPng7GTQgY_.png)
 # Part 4: Advanced Feature Engineering & Preprocessing
 In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.
 #### 1. Creating the Classification Target
+Created a binary target for the classification task.
+A flight is marked as 'delayed' (1) if the arrival delay exceeds 15 minutes.
 #### 2. Categorical Encoding
+Converted categorical text features into numerical values.
+This process enables the machine learning models to process non-numeric data columns.
 #### 3. Feature Scaling
+Scaled the numerical features using the StandardScaler tool from sklearn.
+This ensures all features have a mean of 0 and a standard deviation of 1.
 #### 4. Feature Engineering with Unsupervised Learning (K-Means)
+Used K-Means clustering to group passengers based on their service ratings.
+The resulting cluster ID is added as a new feature to represent a 'passenger profile'.
 #### 5. Cluster Visualization (PCA)
 To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions.
 This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.
 # Part 5: Model Training & Evaluation
 In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
 #### 1. Updated Linear Regression (Refined Baseline)
+Re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables.
 This allowed me to see how much the additional feature engineering improved the initial baseline performance.
 ###  2. Decision Tree Regressor
+Implemented a Decision Tree model to capture non-linear relationships between the features.
 To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
 ### 3. Random Forest Regressor (Ensemble Method)
+Trained a Random Forest model consisting of 100 individual trees.
 By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
+Score compared to a single decision tree.
 ### 4. Performance Comparison & Visualization
+Created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/rPEwtAabowCwz5O5_FH7O.png)
 we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
+# Part 7: Regression-to-Classification
+In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
 ### 7.1 Creating Classes from Numeric Target
 Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
 Threshold Selection:defined the cutoff point at 0 minutes.