Update README file.
Browse files
README.md
CHANGED
|
@@ -1,26 +1,26 @@
|
|
| 1 |
Flight Delay Prediction — Full Project (Parts 1-8) -
|
|
|
|
| 2 |
Overview -
|
| 3 |
-
This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
|
| 4 |
-
The work is structured into eight clear stages, following the assignment’s required workflow.
|
| 5 |
|
| 6 |
1. Dataset Overview -
|
| 7 |
-
The dataset contains ~96K domestic U.S. flights.
|
| 8 |
It includes:
|
| 9 |
-
Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
|
| 10 |
-
Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
|
| 11 |
-
Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
|
| 12 |
-
Target variable: ARRIVAL_DELAY (minutes)
|
| 13 |
-
Goal:
|
| 14 |
-
Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
|
| 15 |
|
| 16 |
2. Exploratory Data Analysis (EDA) -
|
| 17 |
-
Main steps performed:
|
| 18 |
- Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
|
| 19 |
- Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
|
| 20 |
- Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
|
| 21 |
- Compared delays between airlines.
|
| 22 |
- Visualized relationship between distance and delay.
|
| 23 |
-
Key Findings:
|
| 24 |
- Certain months show heavier congestion.
|
| 25 |
- Evening flights have systematically higher delays (“snowball effect”).
|
| 26 |
- Airlines differ strongly in punctuality.
|
|
@@ -29,85 +29,86 @@ Key Findings:
|
|
| 29 |

|
| 30 |
|
| 31 |
3. Baseline Regression Model -
|
| 32 |
-
Steps completed:
|
| 33 |
- Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
|
| 34 |
- Used only information available before takeoff.
|
| 35 |
- Trained a simple Linear Regression model.
|
| 36 |
- Evaluated the model using MAE, MSE, RMSE, and R².
|
| 37 |
-
Results (Baseline):
|
| 38 |
- RMSE ≈ 9.23 minutes
|
| 39 |
- R² ≈ 0.88
|
| 40 |
- Train/test scores were close - no overfitting.
|
| 41 |
|
| 42 |
4. Feature Engineering -
|
| 43 |
-
Performed multiple transformations to enhance model performance:
|
| 44 |
-
4.1 Encoding -
|
| 45 |
-
One-Hot Encoding - AIRLINE
|
| 46 |
-
Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
|
| 47 |
-
4.2 New Features -
|
| 48 |
-
IS_WEEKEND — captures weekend travel differences
|
| 49 |
-
4.3 Clustering -
|
| 50 |
-
Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
|
| 51 |
-
Added new feature: FLIGHT_CLUSTER
|
| 52 |
-
4.4 Dimensionality Reduction -
|
| 53 |
-
PCA visualization to validate that clusters form meaningful groups
|
| 54 |
-
4.5 Scaling -
|
| 55 |
-
Removed leakage / irrelevant fields
|
| 56 |
-
Scaled the final 33 features using StandardScaler
|
| 57 |

|
| 58 |
|
| 59 |
5. Improved Regression Models -
|
| 60 |
-
Trained three models on the engineered dataset:
|
| 61 |
- Linear Regression (Improved)
|
| 62 |
- Random Forest Regressor
|
| 63 |
- Gradient Boosting Regressor
|
| 64 |
-
Results:
|
| 65 |
-
Gradient Boosting achieved best performance
|
| 66 |
- RMSE ≈ 9.04
|
| 67 |
- R² ≈ 0.89
|
| 68 |
-
This improves over the baseline because tree-based models capture non-linear relationships.
|
| 69 |

|
| 70 |

|
| 71 |
|
| 72 |
6. Winning Regression Model + Deployment -
|
| 73 |
- Selected Gradient Boosting Regressor as winner.
|
| 74 |
- Exported it using pickle:
|
| 75 |
-
with open("winning_model.pkl", "wb") as f:
|
| 76 |
-
|
|
|
|
| 77 |
- Uploaded to a dedicated HuggingFace model repository.
|
| 78 |
|
| 79 |
7. Regression - Classification -
|
| 80 |
-
7.1 Creating Classes -
|
| 81 |
-
Converted arrival delay into 3 classes using quantile binning:
|
| 82 |
- Class 0: lowest 33% delays
|
| 83 |
- Class 1: middle 33%
|
| 84 |
- Class 2: highest 33%
|
| 85 |
-
Why?
|
| 86 |
-
This ensures balanced classes and avoids distortions caused by skewed delay distributions.
|
| 87 |
-
7.2 Class Balance Check -
|
| 88 |
-
Class distribution remained balanced (≈ 32–36% per class).
|
| 89 |
-
Therefore:
|
| 90 |
- Accuracy is meaningful
|
| 91 |
- Also tracked macro-F1 to ensure fair performance across classes
|
| 92 |

|
| 93 |
|
| 94 |
8. Classification Models -
|
| 95 |
-
8.1 Precision vs Recall -
|
| 96 |
-
We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
|
| 97 |
-
8.2 Models Trained
|
| 98 |
- Logistic Regression
|
| 99 |
- Random Forest Classifier
|
| 100 |
- Gradient Boosting Classifier
|
| 101 |
-
All trained on the engineered features.
|
| 102 |
-
8.3 Evaluation
|
| 103 |
-
For each model:
|
| 104 |
- Classification report (precision, recall, F1-score)
|
| 105 |
- Confusion matrix
|
| 106 |
- Analysis of error patterns
|
| 107 |
-
Best model (macro F1): Logistic Regression
|
| 108 |
-
Even though simple, it produced the most balanced performance across all classes.
|
| 109 |

|
| 110 |
-
8.4 Exporting the Winning Classifier
|
| 111 |
-
with open("winning_classifier.pkl", "wb") as f:
|
| 112 |
-
|
| 113 |
-
Uploaded the classifier to the same HF repository as required.
|
|
|
|
| 1 |
Flight Delay Prediction — Full Project (Parts 1-8) -
|
| 2 |
+
|
| 3 |
Overview -
|
| 4 |
+
- This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
|
| 5 |
+
- The work is structured into eight clear stages, following the assignment’s required workflow.
|
| 6 |
|
| 7 |
1. Dataset Overview -
|
| 8 |
+
- The dataset contains ~96K domestic U.S. flights.
|
| 9 |
It includes:
|
| 10 |
+
- Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
|
| 11 |
+
- Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
|
| 12 |
+
- Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
|
| 13 |
+
- Target variable: ARRIVAL_DELAY (minutes)
|
| 14 |
+
- Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
|
|
|
|
| 15 |
|
| 16 |
2. Exploratory Data Analysis (EDA) -
|
| 17 |
+
- Main steps performed:
|
| 18 |
- Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
|
| 19 |
- Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
|
| 20 |
- Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
|
| 21 |
- Compared delays between airlines.
|
| 22 |
- Visualized relationship between distance and delay.
|
| 23 |
+
- Key Findings:
|
| 24 |
- Certain months show heavier congestion.
|
| 25 |
- Evening flights have systematically higher delays (“snowball effect”).
|
| 26 |
- Airlines differ strongly in punctuality.
|
|
|
|
| 29 |

|
| 30 |
|
| 31 |
3. Baseline Regression Model -
|
| 32 |
+
- Steps completed:
|
| 33 |
- Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
|
| 34 |
- Used only information available before takeoff.
|
| 35 |
- Trained a simple Linear Regression model.
|
| 36 |
- Evaluated the model using MAE, MSE, RMSE, and R².
|
| 37 |
+
- Results (Baseline):
|
| 38 |
- RMSE ≈ 9.23 minutes
|
| 39 |
- R² ≈ 0.88
|
| 40 |
- Train/test scores were close - no overfitting.
|
| 41 |
|
| 42 |
4. Feature Engineering -
|
| 43 |
+
- Performed multiple transformations to enhance model performance:
|
| 44 |
+
- 4.1 Encoding -
|
| 45 |
+
- One-Hot Encoding - AIRLINE
|
| 46 |
+
- Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
|
| 47 |
+
- 4.2 New Features -
|
| 48 |
+
- IS_WEEKEND — captures weekend travel differences
|
| 49 |
+
- 4.3 Clustering -
|
| 50 |
+
- Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
|
| 51 |
+
- Added new feature: FLIGHT_CLUSTER
|
| 52 |
+
- 4.4 Dimensionality Reduction -
|
| 53 |
+
- PCA visualization to validate that clusters form meaningful groups
|
| 54 |
+
- 4.5 Scaling -
|
| 55 |
+
- Removed leakage / irrelevant fields
|
| 56 |
+
- Scaled the final 33 features using StandardScaler
|
| 57 |

|
| 58 |
|
| 59 |
5. Improved Regression Models -
|
| 60 |
+
- Trained three models on the engineered dataset:
|
| 61 |
- Linear Regression (Improved)
|
| 62 |
- Random Forest Regressor
|
| 63 |
- Gradient Boosting Regressor
|
| 64 |
+
- Results:
|
| 65 |
+
- Gradient Boosting achieved best performance
|
| 66 |
- RMSE ≈ 9.04
|
| 67 |
- R² ≈ 0.89
|
| 68 |
+
- This improves over the baseline because tree-based models capture non-linear relationships.
|
| 69 |

|
| 70 |

|
| 71 |
|
| 72 |
6. Winning Regression Model + Deployment -
|
| 73 |
- Selected Gradient Boosting Regressor as winner.
|
| 74 |
- Exported it using pickle:
|
| 75 |
+
- with open("winning_model.pkl", "wb") as f:
|
| 76 |
+
- pickle.dump(best_model, f)
|
| 77 |
+
|
| 78 |
- Uploaded to a dedicated HuggingFace model repository.
|
| 79 |
|
| 80 |
7. Regression - Classification -
|
| 81 |
+
- 7.1 Creating Classes -
|
| 82 |
+
- Converted arrival delay into 3 classes using quantile binning:
|
| 83 |
- Class 0: lowest 33% delays
|
| 84 |
- Class 1: middle 33%
|
| 85 |
- Class 2: highest 33%
|
| 86 |
+
- Why?
|
| 87 |
+
- This ensures balanced classes and avoids distortions caused by skewed delay distributions.
|
| 88 |
+
- 7.2 Class Balance Check -
|
| 89 |
+
- Class distribution remained balanced (≈ 32–36% per class).
|
| 90 |
+
- Therefore:
|
| 91 |
- Accuracy is meaningful
|
| 92 |
- Also tracked macro-F1 to ensure fair performance across classes
|
| 93 |

|
| 94 |
|
| 95 |
8. Classification Models -
|
| 96 |
+
- 8.1 Precision vs Recall -
|
| 97 |
+
- We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
|
| 98 |
+
- 8.2 Models Trained
|
| 99 |
- Logistic Regression
|
| 100 |
- Random Forest Classifier
|
| 101 |
- Gradient Boosting Classifier
|
| 102 |
+
- All trained on the engineered features.
|
| 103 |
+
- 8.3 Evaluation
|
| 104 |
+
- For each model:
|
| 105 |
- Classification report (precision, recall, F1-score)
|
| 106 |
- Confusion matrix
|
| 107 |
- Analysis of error patterns
|
| 108 |
+
- Best model (macro F1): Logistic Regression
|
| 109 |
+
- Even though simple, it produced the most balanced performance across all classes.
|
| 110 |

|
| 111 |
+
- 8.4 Exporting the Winning Classifier
|
| 112 |
+
- with open("winning_classifier.pkl", "wb") as f:
|
| 113 |
+
- pickle.dump(best_cls_model, f)
|
| 114 |
+
- Uploaded the classifier to the same HF repository as required.
|