https://drive.google.com/file/d/16rbiBsJlo9gm5-Mq28ctsXCEuzfLZA5E/view?usp=drive_link Flight Delay Prediction — Full Project (Parts 1-8) - Overview - - This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays. - The work is structured into eight clear stages, following the assignment’s required workflow. 1. Dataset Overview - - The dataset contains ~96K domestic U.S. flights. It includes: - Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.) - Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT) - Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.) - Target variable: ARRIVAL_DELAY (minutes) - Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality. 2. Exploratory Data Analysis (EDA) - - Main steps performed: - Checked missing values → Only a few columns contained missing values, and all were handled explicitly. - Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft). - Examined seasonality and time-of-day patterns (monthly delays, hourly delays). - Compared delays between airlines. - Visualized relationship between distance and delay. - Key Findings: - Certain months show heavier congestion. - Evening flights have systematically higher delays (“snowball effect”). - Airlines differ strongly in punctuality. - Distance has almost no explanatory power. ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/rCsDltEV7JMkxObEGve69.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png) 3. Baseline Regression Model - - Steps completed: - Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON). - Used only information available before takeoff. - Trained a simple Linear Regression model. - Evaluated the model using MAE, MSE, RMSE, and R². - Results (Baseline): - RMSE ≈ 9.23 minutes - R² ≈ 0.88 - Train/test scores were close - no overfitting. 4. Feature Engineering - - Performed multiple transformations to enhance model performance: - 4.1 Encoding - - One-Hot Encoding - AIRLINE - Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT - 4.2 New Features - - IS_WEEKEND — captures weekend travel differences - 4.3 Clustering - - Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME - Added new feature: FLIGHT_CLUSTER - 4.4 Dimensionality Reduction - - PCA visualization to validate that clusters form meaningful groups - 4.5 Scaling - - Removed leakage / irrelevant fields - Scaled the final 33 features using StandardScaler ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png) 5. Improved Regression Models - - Trained three models on the engineered dataset: - Linear Regression (Improved) - Random Forest Regressor - Gradient Boosting Regressor - Results: - Gradient Boosting achieved best performance - RMSE ≈ 9.04 - R² ≈ 0.89 - This improves over the baseline because tree-based models capture non-linear relationships. ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png) 6. Winning Regression Model + Deployment - - Selected Gradient Boosting Regressor as winner. - Exported it using pickle: - with open("winning_model.pkl", "wb") as f: - pickle.dump(best_model, f) - Uploaded to a dedicated HuggingFace model repository. 7. Regression - Classification - - 7.1 Creating Classes - - Converted arrival delay into 3 classes using quantile binning: - Class 0: lowest 33% delays - Class 1: middle 33% - Class 2: highest 33% - Why? - This ensures balanced classes and avoids distortions caused by skewed delay distributions. - 7.2 Class Balance Check - - Class distribution remained balanced (≈ 32–36% per class). - Therefore: - Accuracy is meaningful - Also tracked macro-F1 to ensure fair performance across classes ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png) 8. Classification Models - - 8.1 Precision vs Recall - - We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite. - 8.2 Models Trained - Logistic Regression - Random Forest Classifier - Gradient Boosting Classifier - All trained on the engineered features. - 8.3 Evaluation - For each model: - Classification report (precision, recall, F1-score) - Confusion matrix - Analysis of error patterns - Best model (macro F1): Logistic Regression - Even though simple, it produced the most balanced performance across all classes. ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png) - 8.4 Exporting the Winning Classifier - with open("winning_classifier.pkl", "wb") as f: - pickle.dump(best_cls_model, f) - Uploaded the classifier to the same HF repository as required.