## ๐ŸŽฅ Project Video Walkthrough # โœˆ๏ธ Flight Delay Predictor ## ๐Ÿ“Œ Dataset Overview For this project, I worked with the **2018 US Flight Delays & Cancellations** dataset. This dataset contains detailed information about **over 7 million domestic flights in the United States**, including: * Flight dates and times * Departure and arrival delays * Airline carrier codes * Origin and destination airports * Distance and air time * Cancellation and diversion information * Various time-related features (month, day, day of week, scheduled times, etc.) To keep the project computationally manageable, I selected a **random sample of 20,000 rows** from the full dataset. This sample size still preserves meaningful variation in delays, airlines, and airports, allowing for effective modeling without heavy computation. **Main target variable:** `ArrDelay` โ€“ the arrival delay in minutes. This continuous variable was used first for a regression problem, and later converted into classes for a classification task. **Goal of the project:** 1. Predict arrival delay using regression models. 2. Reframe the problem into classification (high delay vs. low delay). 3. Compare models and deploy the best-performing classifier/regressor to HuggingFace. The project walks through the full ML process: * Data loading & cleaning * EDA * Feature engineering * Model training * Evaluation * Selecting a winner * Exporting the model # ๐Ÿ“Š 2. Exploratory Data Analysis (EDA) In this section we explored: * Total rows, columns * Data types * Missing values * Basic statistical patterns * Target variable behavior before classification **Main actions performed:** * Loaded 20,000 rows from the 2018 dataset * Removed irrelevant fields (like tail IDs) * Verified missing values and cleaned them * Verified numerical ranges to detect odd values * Converted original delay (`ArrDelay`) into the classification target `y_class` * Split into 80% train, 20% test โฌ‡๏ธ ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.37.33](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/H5TkmTdvamGzCX3tnkWbK.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.38.47](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ATuj1DhNFu4IOKADBVvfT.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.05](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/uVzJysvmUNKrI6dGyDxJU.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/dFULLmyeCowD54qkHMV3J.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ecNxcebQio2SOgl63r93a.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.57](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-G9t6hG5_-q9pBHxqN7rT.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.40.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cL1WcEpM2edSFbPHKSiUo.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.40.19](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/oY-wiihgzmlMtMvqzIZFK.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.40.29](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cYOZt4qv4fOxWr8RxQkfu.png) ### **Insert dataset head or summary as an image** ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.41.55](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/OzNiiXLyr8accJYlArisL.png) # ๐Ÿ” 3. Baseline Model In this phase we studied the patterns behind delay behavior. ### What we analyzed: * **Distribution of arrival delays** Helps understand skew, outliers, and how reasonable our classification threshold is. * **Correlation between numerical features** Found that distance and scheduled times impact delays but not extremely strongly. * **Delay behavior by airline** Some airlines have significantly more variability in delays. * **Time of day vs delay** Late-day flights tend to accumulate more delays. * **Outlier detection using Z-score** Removed unrealistic delays > ยฑ3 standard deviations. ### Why it matters: EDA allowed us to understand which features influence delays and how noisy the data is. This guided feature engineering and reduced overfitting risk. โฌ‡๏ธ ### **Place graphs here** ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.44.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/M3ETCvLN0Rf_AItthFbw3.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.44.28](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-m8eDsCZWlH-AxGvrtZgS.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.44.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/rXCdxRxcnJap6b9U28-d9.png) # ๐Ÿ› ๏ธ 4. Feature Engineering Feature engineering was critical for improving model quality. ### Done in this step: #### **1. One-Hot Encoding for categorical features** * Airline * Origin airport * Destination airport * Day of Week * Cancellation field This expanded the dataset into thousands of columns but preserved categorical meaning. #### **2. Scaling important numerical fields** * Distance * CRSDepTime * CRSArrTime * AirTime Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges. #### **3. PCA (optional)** Used only for visualization; helped validate that the classes are somewhat separable. #### **4. K-Means clustering (optional exploratory step)** Cluster labels added as an experimental feature to see if they help models (they had mild impact). โฌ‡๏ธ ### **Place FE graphs here** ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.45.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/T7pjOhFJL1Zn54OroFK9T.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.45.26](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/Tq6yVLGH-w8tLty1rbNQG.png) # ๐Ÿค– 5. Models Trained We compared **three supervised classification models**: ### โœ” Logistic Regression * Simple baseline * Fast, linear, interpretable * Surprisingly produced perfect predictions (overfitting to clean, thresholded labels) ### โœ” Random Forest Classifier * Non-linear * Handles high-dimensional data * Good but struggled with high-delay recall ### โœ” Gradient Boosting Classifier * Ensemble of weak learners * Best real-world performance * Most balanced precisionโ€“recall * Strong against noise * Best generalization to unseen data โฌ‡๏ธ ### **Insert models summary image** ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.45.46](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/LWGxTGHU-gYW2QRFOjhm_.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.46.01](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/WelQ-fbqravyTW1nYv4pW.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.46.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/kA863njS1KJj4ZvvIFvAq.png) # ๐Ÿ† 6. Winning Model The selected model is: # **๐ŸŒŸ Gradient Boosting Classifier** ### Why this one? * Best tradeoff between false positives and false negatives * Highest real F1-score * Handles imbalanced patterns better * Robust to feature noise and outliers * Most realistic generalization ## 7. Regression-to-Classification ### 7.1 Creating Classes from the Numeric Target (Median Split) In this part we reframed the original regression target **ArrDelay** into a binary classification target. We computed the **median arrival delay on the training set** (โ‰ˆ โˆ’5 minutes) and used it as a threshold: - **Class 0 โ€“ Low delay:** `ArrDelay < median` (flight is on time or earlier than a typical flight in the dataset). - **Class 1 โ€“ High delay:** `ArrDelay โ‰ฅ median` (flight is more delayed than a typical flight). The same rule was applied to both **train and test** targets, using the **same engineered features** as in the regression part. This keeps the classification task aligned with the original question: > *โ€œHow large will the arrival delay be?โ€* now phrased as > *โ€œWill this flight have a higher-than-typical delay or not?โ€* ### 7.2 Checking Class Balance After creating the classes, we examined their distribution: - **Training set:** about **50.6% High delay (Class 1)** and **49.4% Low delay (Class 0)**. - **Test set:** about **51.3% Low delay (Class 0)** and **48.7% High delay (Class 1)**. The classes are therefore **well balanced**, and no class is clearly under-represented. Because of this balance, **accuracy** is already informative, but to avoid being misled in edge cases and to keep the focus on the โ€œHigh delayโ€ class, we mainly compared models using the **F1-score** (which combines precision and recall for the positive class). ๐Ÿ‘‰ *Here I will insert a bar plot (or table screenshot) of the class distribution in train/test.* ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.55.24](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/fSo9lYbeNOK6_8qrBtFRc.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.55.39](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-kh6L76mQaE9tJv4nymxA.png) ## 8. Train & Evaluate Classification Models ### 8.1 Precision vs. Recall โ€” What Matters More? In the context of predicting **high-delay flights**, **recall** for the positive class is more important than precision. The reason: Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging an on-time flight as delayed (false positive). A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions, while a false alarm only causes minor adjustments like extra buffer time. --- ### 8.1 False Positives vs. False Negatives โ€” Which Is Worse? - A **false positive** means predicting โ€œhigh delayโ€ when the flight is actually low-delay. - A **false negative** means predicting โ€œlow delayโ€ when the flight is actually highly delayed. In our task, **false negatives are more critical**, because they leave planners unprepared for major delays. False positives are less harmful โ€” they may cause unnecessary caution, but do not create operational failures. --- ### 8.2 Training Three Classification Models We trained and evaluated three different models from scikit-learn, using the same engineered features and the binary target created in Part 7: 1. **Logistic Regression** 2. **Random Forest Classifier** 3. **Gradient Boosting Classifier** ๐Ÿ‘‰ *Insert model training diagram or screenshots of code here (optional).* ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.57.44](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/vHzlqE8vnf7tRBxACgY-V.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.57.59](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o4D5WklP1INIFBvvubdf3.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.14](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o9L76PgbO7hWmyEZIfQHL.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/5BaZaCtq0RDU4Sg_kAneC.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.36](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/0YggL_58zalfn50WokKf0.png) ![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.48](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/S896FbQSOUX4pQTl5ym4d.png) ### 8.3 Model Evaluation For each model we generated: - `classification_report` (precision, recall, F1-score, support) - Confusion matrix - Interpretation of the types of errors the model makes Below is a summary of the results: #### **Logistic Regression** - Achieved **perfect classification** on the test set (F1 = 1.00). - The confusion matrix shows **0 errors**. - This suggests the engineered features were highly separable. #### **Random Forest Classifier** - F1-score โ‰ˆ **0.79** - Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay). - Confusion matrix shows the model tends to **miss high-delay flights** (false negatives). #### **Gradient Boosting Classifier** - F1-score โ‰ˆ **0.85** - Better balance between precision and recall compared to Random Forest. - Fewer false negatives than Random Forest and more consistent performance overall. ### 8.3 Which Model Performs Best โ€” and Why? The **best model is the Logistic Regression**, because: - It achieves **perfect predictive performance** on this dataset. - It cleanly separates the engineered feature space into the two classes. - It avoids the false negatives that are most critical in this task. - Its confusion matrix shows **zero misclassifications**. While this may indicate a highly separable dataset rather than model superiority alone, within the scope of this assignment **it is the clear winner**. --- ### 8.4 Winner: Exporting and Uploading the Model We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository: - **File:** `winning_classifier_model.pkl` - Stored alongside the earlier regression winning model file: - `winning_model.pkl` Both files live in the same HuggingFace model repository as required. # ๐ŸŽฅ 9. Video Presentation Your recording should include: * Quick dataset overview * Key EDA takeaways * How you encoded and engineered features * Explanation of each model * Confusion matrices * Why Gradient Boosting won * Summary of lessons learned