Yoel125 commited on
Commit
e5548db
·
verified ·
1 Parent(s): 28e48e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -32
README.md CHANGED
@@ -35,17 +35,17 @@ https://huggingface.co/datasets/drukeroni/airline-satisfaction-analysis
35
  In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.
36
 
37
  #### 1. Data Cleaning
38
- selected the 15 features that are most relevant to my research question.
39
- checked for missing values and decided to drop rows where the target variable (Arrival Delay in Minutes) was missing to keep the data accurate.
40
- verified that there are no duplicate rows in the dataset.
41
- standardized all text columns by converting them to lowercase and removing extra spaces.
42
  #### 2. Descriptive Statistics & Data Structure
43
- checked the data types and the final shape of the table after cleaning.
44
- calculated the percentage and count for each category, like Gender, Customer Type, and satisfaction, to see how the data is distributed.
45
- checked the service rating scales (like Inflight wifi service) to make sure all values are between 1 and 5.
46
  #### 3. Outlier Detection
47
- used the IQR (Interquartile Range) method to find outliers in columns like Age, Flight Distance, and delay times.
48
- calculated the percentage of outliers for each feature to understand how many extreme values exist in the data.
49
  #### Data Exploration: Answering Key Research Questions through Visualization
50
  #### Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?
51
  ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oIXebKJ0KqMjckAlGjYn-.png)
@@ -74,14 +74,14 @@ Removing the extreme anomalies allows us to visualize the core data patterns tha
74
  # Part 3: Baseline Regression Modeling
75
  In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.
76
  ### 1. Data Preparation & Feature Selection
77
- defined Arrival Delay in Minutes as the target variable.
78
- selected 8 key numerical features (like Age and Flight Distance) as predictors.
79
- dropped missing values again just to be 100% sure the data is completely clean for the model.
80
  #### 2. Model Training
81
- splited the data into 80% training and 20% testing sets (using random_state=42 for consistency).
82
- trained a basic Linear Regression model to learn the relationship between the features and delays.
83
  #### 3. Performance Evaluation
84
- evaluated the model's accuracy using MAE, MSE, RMSE, and R-squared.
85
 
86
  #### Baseline Model: Actual vs Predicted Arrival Delays
87
  ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/_swJewXEptmPng7GTQgY_.png)
@@ -95,17 +95,17 @@ Other features, such as passenger age or flight distance, show very low importan
95
  # Part 4: Advanced Feature Engineering & Preprocessing
96
  In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.
97
  #### 1. Creating the Classification Target
98
- converted the regression problem into a classification task by creating a binary feature, is_delayed.
99
- A flight is labeled as "Delayed" (1) if the arrival delay is greater than 15 minutes, otherwise it is labeled (0).
100
  #### 2. Categorical Encoding
101
- transformed all text-based features (like Gender, Class, and Customer Type) into numeric format using One-Hot Encoding.
102
- used the drop_first=True logic to prevent multi-collinearity, ensuring the model remains statistically stable.
103
  #### 3. Feature Scaling
104
- applied StandardScaler to numerical columns such as Age and Flight Distance.
105
- This normalizes the data so that features with larger scales do not unfairly dominate the model's learning process.
106
  #### 4. Feature Engineering with Unsupervised Learning (K-Means)
107
- used K-Means Clustering to group passengers into 3 distinct "Service Profiles" based on their ratings of inflight services (Wi-Fi, Cleanliness, etc.).
108
- This new feature, passenger_profile, allows the model to understand complex patterns of passenger satisfaction.
109
  #### 5. Cluster Visualization (PCA)
110
  To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions.
111
  This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.
@@ -116,23 +116,21 @@ This shows that passengers are grouped into three distinct "service profiles" ba
116
  # Part 5: Model Training & Evaluation
117
  In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
118
  #### 1. Updated Linear Regression (Refined Baseline)
119
- re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables.
120
  This allowed me to see how much the additional feature engineering improved the initial baseline performance.
121
  ### 2. Decision Tree Regressor
122
- implemented a Decision Tree model to capture non-linear relationships between the features.
123
  To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
124
  ### 3. Random Forest Regressor (Ensemble Method)
125
- trained a Random Forest model consisting of 100 individual trees.
126
  By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
127
- score compared to a single decision tree.
128
  ### 4. Performance Comparison & Visualization
129
- created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm.
130
  ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/rPEwtAabowCwz5O5_FH7O.png)
131
  we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
132
- Part 7: Regression-to-Classification
133
- In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
134
-
135
- # Part 7
136
  ### 7.1 Creating Classes from Numeric Target
137
  Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
138
  Threshold Selection:defined the cutoff point at 0 minutes.
 
35
  In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.
36
 
37
  #### 1. Data Cleaning
38
+ Selected the 15 features that are most relevant to my research question.
39
+ Checked for missing values and decided to drop rows where the target variable (Arrival Delay in Minutes) was missing to keep the data accurate.
40
+ Verified that there are no duplicate rows in the dataset.
41
+ Standardized all text columns by converting them to lowercase and removing extra spaces.
42
  #### 2. Descriptive Statistics & Data Structure
43
+ Checked the data types and the final shape of the table after cleaning.
44
+ Calculated the percentage and count for each category, like Gender, Customer Type, and satisfaction, to see how the data is distributed.
45
+ Checked the service rating scales (like Inflight wifi service) to make sure all values are between 1 and 5.
46
  #### 3. Outlier Detection
47
+ Used the IQR (Interquartile Range) method to find outliers in columns like Age, Flight Distance, and delay times.
48
+ Calculated the percentage of outliers for each feature to understand how many extreme values exist in the data.
49
  #### Data Exploration: Answering Key Research Questions through Visualization
50
  #### Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?
51
  ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oIXebKJ0KqMjckAlGjYn-.png)
 
74
  # Part 3: Baseline Regression Modeling
75
  In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.
76
  ### 1. Data Preparation & Feature Selection
77
+ Defined Arrival Delay in Minutes as the target variable.
78
+ Selected 8 key numerical features (like Age and Flight Distance) as predictors.
79
+ Dropped missing values again just to be 100% sure the data is completely clean for the model.
80
  #### 2. Model Training
81
+ Splited the data into 80% training and 20% testing sets (using random_state=42 for consistency).
82
+ Trained a basic Linear Regression model to learn the relationship between the features and delays.
83
  #### 3. Performance Evaluation
84
+ Evaluated the model's accuracy using MAE, MSE, RMSE, and R-squared.
85
 
86
  #### Baseline Model: Actual vs Predicted Arrival Delays
87
  ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/_swJewXEptmPng7GTQgY_.png)
 
95
  # Part 4: Advanced Feature Engineering & Preprocessing
96
  In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.
97
  #### 1. Creating the Classification Target
98
+ Created a binary target for the classification task.
99
+ A flight is marked as 'delayed' (1) if the arrival delay exceeds 15 minutes.
100
  #### 2. Categorical Encoding
101
+ Converted categorical text features into numerical values.
102
+ This process enables the machine learning models to process non-numeric data columns.
103
  #### 3. Feature Scaling
104
+ Scaled the numerical features using the StandardScaler tool from sklearn.
105
+ This ensures all features have a mean of 0 and a standard deviation of 1.
106
  #### 4. Feature Engineering with Unsupervised Learning (K-Means)
107
+ Used K-Means clustering to group passengers based on their service ratings.
108
+ The resulting cluster ID is added as a new feature to represent a 'passenger profile'.
109
  #### 5. Cluster Visualization (PCA)
110
  To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions.
111
  This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.
 
116
  # Part 5: Model Training & Evaluation
117
  In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
118
  #### 1. Updated Linear Regression (Refined Baseline)
119
+ Re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables.
120
  This allowed me to see how much the additional feature engineering improved the initial baseline performance.
121
  ### 2. Decision Tree Regressor
122
+ Implemented a Decision Tree model to capture non-linear relationships between the features.
123
  To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
124
  ### 3. Random Forest Regressor (Ensemble Method)
125
+ Trained a Random Forest model consisting of 100 individual trees.
126
  By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
127
+ Score compared to a single decision tree.
128
  ### 4. Performance Comparison & Visualization
129
+ Created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm.
130
  ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/rPEwtAabowCwz5O5_FH7O.png)
131
  we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
132
+ # Part 7: Regression-to-Classification
133
+ In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
 
 
134
  ### 7.1 Creating Classes from Numeric Target
135
  Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
136
  Threshold Selection:defined the cutoff point at 0 minutes.