| https://drive.google.com/file/d/16rbiBsJlo9gm5-Mq28ctsXCEuzfLZA5E/view?usp=drive_link | |
| Flight Delay Prediction — Full Project (Parts 1-8) - | |
| Overview - | |
| - This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays. | |
| - The work is structured into eight clear stages, following the assignment’s required workflow. | |
| 1. Dataset Overview - | |
| - The dataset contains ~96K domestic U.S. flights. | |
| It includes: | |
| - Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.) | |
| - Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT) | |
| - Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.) | |
| - Target variable: ARRIVAL_DELAY (minutes) | |
| - Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality. | |
| 2. Exploratory Data Analysis (EDA) - | |
| - Main steps performed: | |
| - Checked missing values → Only a few columns contained missing values, and all were handled explicitly. | |
| - Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft). | |
| - Examined seasonality and time-of-day patterns (monthly delays, hourly delays). | |
| - Compared delays between airlines. | |
| - Visualized relationship between distance and delay. | |
| - Key Findings: | |
| - Certain months show heavier congestion. | |
| - Evening flights have systematically higher delays (“snowball effect”). | |
| - Airlines differ strongly in punctuality. | |
| - Distance has almost no explanatory power. | |
|  | |
|  | |
| 3. Baseline Regression Model - | |
| - Steps completed: | |
| - Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON). | |
| - Used only information available before takeoff. | |
| - Trained a simple Linear Regression model. | |
| - Evaluated the model using MAE, MSE, RMSE, and R². | |
| - Results (Baseline): | |
| - RMSE ≈ 9.23 minutes | |
| - R² ≈ 0.88 | |
| - Train/test scores were close - no overfitting. | |
| 4. Feature Engineering - | |
| - Performed multiple transformations to enhance model performance: | |
| - 4.1 Encoding - | |
| - One-Hot Encoding - AIRLINE | |
| - Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT | |
| - 4.2 New Features - | |
| - IS_WEEKEND — captures weekend travel differences | |
| - 4.3 Clustering - | |
| - Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME | |
| - Added new feature: FLIGHT_CLUSTER | |
| - 4.4 Dimensionality Reduction - | |
| - PCA visualization to validate that clusters form meaningful groups | |
| - 4.5 Scaling - | |
| - Removed leakage / irrelevant fields | |
| - Scaled the final 33 features using StandardScaler | |
|  | |
| 5. Improved Regression Models - | |
| - Trained three models on the engineered dataset: | |
| - Linear Regression (Improved) | |
| - Random Forest Regressor | |
| - Gradient Boosting Regressor | |
| - Results: | |
| - Gradient Boosting achieved best performance | |
| - RMSE ≈ 9.04 | |
| - R² ≈ 0.89 | |
| - This improves over the baseline because tree-based models capture non-linear relationships. | |
|  | |
|  | |
| 6. Winning Regression Model + Deployment - | |
| - Selected Gradient Boosting Regressor as winner. | |
| - Exported it using pickle: | |
| - with open("winning_model.pkl", "wb") as f: | |
| - pickle.dump(best_model, f) | |
| - Uploaded to a dedicated HuggingFace model repository. | |
| 7. Regression - Classification - | |
| - 7.1 Creating Classes - | |
| - Converted arrival delay into 3 classes using quantile binning: | |
| - Class 0: lowest 33% delays | |
| - Class 1: middle 33% | |
| - Class 2: highest 33% | |
| - Why? | |
| - This ensures balanced classes and avoids distortions caused by skewed delay distributions. | |
| - 7.2 Class Balance Check - | |
| - Class distribution remained balanced (≈ 32–36% per class). | |
| - Therefore: | |
| - Accuracy is meaningful | |
| - Also tracked macro-F1 to ensure fair performance across classes | |
|  | |
| 8. Classification Models - | |
| - 8.1 Precision vs Recall - | |
| - We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite. | |
| - 8.2 Models Trained | |
| - Logistic Regression | |
| - Random Forest Classifier | |
| - Gradient Boosting Classifier | |
| - All trained on the engineered features. | |
| - 8.3 Evaluation | |
| - For each model: | |
| - Classification report (precision, recall, F1-score) | |
| - Confusion matrix | |
| - Analysis of error patterns | |
| - Best model (macro F1): Logistic Regression | |
| - Even though simple, it produced the most balanced performance across all classes. | |
|  | |
| - 8.4 Exporting the Winning Classifier | |
| - with open("winning_classifier.pkl", "wb") as f: | |
| - pickle.dump(best_cls_model, f) | |
| - Uploaded the classifier to the same HF repository as required. |