Update README.md
Browse files
README.md
CHANGED
|
@@ -35,17 +35,17 @@ https://huggingface.co/datasets/drukeroni/airline-satisfaction-analysis
|
|
| 35 |
In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.
|
| 36 |
|
| 37 |
#### 1. Data Cleaning
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
#### 2. Descriptive Statistics & Data Structure
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
#### 3. Outlier Detection
|
| 47 |
-
|
| 48 |
-
|
| 49 |
#### Data Exploration: Answering Key Research Questions through Visualization
|
| 50 |
#### Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?
|
| 51 |

|
|
@@ -74,14 +74,14 @@ Removing the extreme anomalies allows us to visualize the core data patterns tha
|
|
| 74 |
# Part 3: Baseline Regression Modeling
|
| 75 |
In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.
|
| 76 |
### 1. Data Preparation & Feature Selection
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
#### 2. Model Training
|
| 81 |
-
|
| 82 |
-
|
| 83 |
#### 3. Performance Evaluation
|
| 84 |
-
|
| 85 |
|
| 86 |
#### Baseline Model: Actual vs Predicted Arrival Delays
|
| 87 |

|
|
@@ -95,17 +95,17 @@ Other features, such as passenger age or flight distance, show very low importan
|
|
| 95 |
# Part 4: Advanced Feature Engineering & Preprocessing
|
| 96 |
In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.
|
| 97 |
#### 1. Creating the Classification Target
|
| 98 |
-
|
| 99 |
-
A flight is
|
| 100 |
#### 2. Categorical Encoding
|
| 101 |
-
|
| 102 |
-
|
| 103 |
#### 3. Feature Scaling
|
| 104 |
-
|
| 105 |
-
This
|
| 106 |
#### 4. Feature Engineering with Unsupervised Learning (K-Means)
|
| 107 |
-
|
| 108 |
-
|
| 109 |
#### 5. Cluster Visualization (PCA)
|
| 110 |
To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions.
|
| 111 |
This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.
|
|
@@ -116,23 +116,21 @@ This shows that passengers are grouped into three distinct "service profiles" ba
|
|
| 116 |
# Part 5: Model Training & Evaluation
|
| 117 |
In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
|
| 118 |
#### 1. Updated Linear Regression (Refined Baseline)
|
| 119 |
-
|
| 120 |
This allowed me to see how much the additional feature engineering improved the initial baseline performance.
|
| 121 |
### 2. Decision Tree Regressor
|
| 122 |
-
|
| 123 |
To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
|
| 124 |
### 3. Random Forest Regressor (Ensemble Method)
|
| 125 |
-
|
| 126 |
By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
|
| 127 |
-
|
| 128 |
### 4. Performance Comparison & Visualization
|
| 129 |
-
|
| 130 |

|
| 131 |
we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
|
| 132 |
-
Part 7: Regression-to-Classification
|
| 133 |
-
In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
|
| 134 |
-
|
| 135 |
-
# Part 7
|
| 136 |
### 7.1 Creating Classes from Numeric Target
|
| 137 |
Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
|
| 138 |
Threshold Selection:defined the cutoff point at 0 minutes.
|
|
|
|
| 35 |
In this section, I performed data cleaning and initial analysis to understand the dataset better before building the models.
|
| 36 |
|
| 37 |
#### 1. Data Cleaning
|
| 38 |
+
Selected the 15 features that are most relevant to my research question.
|
| 39 |
+
Checked for missing values and decided to drop rows where the target variable (Arrival Delay in Minutes) was missing to keep the data accurate.
|
| 40 |
+
Verified that there are no duplicate rows in the dataset.
|
| 41 |
+
Standardized all text columns by converting them to lowercase and removing extra spaces.
|
| 42 |
#### 2. Descriptive Statistics & Data Structure
|
| 43 |
+
Checked the data types and the final shape of the table after cleaning.
|
| 44 |
+
Calculated the percentage and count for each category, like Gender, Customer Type, and satisfaction, to see how the data is distributed.
|
| 45 |
+
Checked the service rating scales (like Inflight wifi service) to make sure all values are between 1 and 5.
|
| 46 |
#### 3. Outlier Detection
|
| 47 |
+
Used the IQR (Interquartile Range) method to find outliers in columns like Age, Flight Distance, and delay times.
|
| 48 |
+
Calculated the percentage of outliers for each feature to understand how many extreme values exist in the data.
|
| 49 |
#### Data Exploration: Answering Key Research Questions through Visualization
|
| 50 |
#### Following the detection of outliers in flight distance, how extreme is their distribution and what impact might they have on the model's scaling?
|
| 51 |

|
|
|
|
| 74 |
# Part 3: Baseline Regression Modeling
|
| 75 |
In this section, we built a baseline Linear Regression model to establish a performance benchmark for predicting flight delays.
|
| 76 |
### 1. Data Preparation & Feature Selection
|
| 77 |
+
Defined Arrival Delay in Minutes as the target variable.
|
| 78 |
+
Selected 8 key numerical features (like Age and Flight Distance) as predictors.
|
| 79 |
+
Dropped missing values again just to be 100% sure the data is completely clean for the model.
|
| 80 |
#### 2. Model Training
|
| 81 |
+
Splited the data into 80% training and 20% testing sets (using random_state=42 for consistency).
|
| 82 |
+
Trained a basic Linear Regression model to learn the relationship between the features and delays.
|
| 83 |
#### 3. Performance Evaluation
|
| 84 |
+
Evaluated the model's accuracy using MAE, MSE, RMSE, and R-squared.
|
| 85 |
|
| 86 |
#### Baseline Model: Actual vs Predicted Arrival Delays
|
| 87 |

|
|
|
|
| 95 |
# Part 4: Advanced Feature Engineering & Preprocessing
|
| 96 |
In this stage, we prepared the dataset for more complex models and created new features to improve prediction power.
|
| 97 |
#### 1. Creating the Classification Target
|
| 98 |
+
Created a binary target for the classification task.
|
| 99 |
+
A flight is marked as 'delayed' (1) if the arrival delay exceeds 15 minutes.
|
| 100 |
#### 2. Categorical Encoding
|
| 101 |
+
Converted categorical text features into numerical values.
|
| 102 |
+
This process enables the machine learning models to process non-numeric data columns.
|
| 103 |
#### 3. Feature Scaling
|
| 104 |
+
Scaled the numerical features using the StandardScaler tool from sklearn.
|
| 105 |
+
This ensures all features have a mean of 0 and a standard deviation of 1.
|
| 106 |
#### 4. Feature Engineering with Unsupervised Learning (K-Means)
|
| 107 |
+
Used K-Means clustering to group passengers based on their service ratings.
|
| 108 |
+
The resulting cluster ID is added as a new feature to represent a 'passenger profile'.
|
| 109 |
#### 5. Cluster Visualization (PCA)
|
| 110 |
To validate the clusters, I used PCA (Principal Component Analysis) to reduce the service ratings into two dimensions.
|
| 111 |
This visual confirmation ensures that the groups created by the K-Means algorithm are distinct and meaningful.
|
|
|
|
| 116 |
# Part 5: Model Training & Evaluation
|
| 117 |
In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
|
| 118 |
#### 1. Updated Linear Regression (Refined Baseline)
|
| 119 |
+
Re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables.
|
| 120 |
This allowed me to see how much the additional feature engineering improved the initial baseline performance.
|
| 121 |
### 2. Decision Tree Regressor
|
| 122 |
+
Implemented a Decision Tree model to capture non-linear relationships between the features.
|
| 123 |
To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
|
| 124 |
### 3. Random Forest Regressor (Ensemble Method)
|
| 125 |
+
Trained a Random Forest model consisting of 100 individual trees.
|
| 126 |
By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
|
| 127 |
+
Score compared to a single decision tree.
|
| 128 |
### 4. Performance Comparison & Visualization
|
| 129 |
+
Created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm.
|
| 130 |

|
| 131 |
we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
|
| 132 |
+
# Part 7: Regression-to-Classification
|
| 133 |
+
In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
|
|
|
|
|
|
|
| 134 |
### 7.1 Creating Classes from Numeric Target
|
| 135 |
Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
|
| 136 |
Threshold Selection:defined the cutoff point at 0 minutes.
|