| ## ๐ฅ Project Video Walkthrough | |
| <video controls width="720"> | |
| <source src="https://huggingface.co/maorsoul/flight-delay-predictor/resolve/main/video1534610661.mp4" type="video/mp4"> | |
| Your browser does not support the video tag. | |
| </video> | |
| # โ๏ธ Flight Delay Predictor | |
| ## ๐ Dataset Overview | |
| For this project, I worked with the **2018 US Flight Delays & Cancellations** dataset. | |
| This dataset contains detailed information about **over 7 million domestic flights in the United States**, including: | |
| * Flight dates and times | |
| * Departure and arrival delays | |
| * Airline carrier codes | |
| * Origin and destination airports | |
| * Distance and air time | |
| * Cancellation and diversion information | |
| * Various time-related features (month, day, day of week, scheduled times, etc.) | |
| To keep the project computationally manageable, I selected a **random sample of 20,000 rows** from the full dataset. | |
| This sample size still preserves meaningful variation in delays, airlines, and airports, allowing for effective modeling without heavy computation. | |
| **Main target variable:** | |
| `ArrDelay` โ the arrival delay in minutes. | |
| This continuous variable was used first for a regression problem, and later converted into classes for a classification task. | |
| **Goal of the project:** | |
| 1. Predict arrival delay using regression models. | |
| 2. Reframe the problem into classification (high delay vs. low delay). | |
| 3. Compare models and deploy the best-performing classifier/regressor to HuggingFace. | |
| The project walks through the full ML process: | |
| * Data loading & cleaning | |
| * EDA | |
| * Feature engineering | |
| * Model training | |
| * Evaluation | |
| * Selecting a winner | |
| * Exporting the model | |
| # ๐ 2. Exploratory Data Analysis (EDA) | |
| In this section we explored: | |
| * Total rows, columns | |
| * Data types | |
| * Missing values | |
| * Basic statistical patterns | |
| * Target variable behavior before classification | |
| **Main actions performed:** | |
| * Loaded 20,000 rows from the 2018 dataset | |
| * Removed irrelevant fields (like tail IDs) | |
| * Verified missing values and cleaned them | |
| * Verified numerical ranges to detect odd values | |
| * Converted original delay (`ArrDelay`) into the classification target `y_class` | |
| * Split into 80% train, 20% test | |
| โฌ๏ธ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| ### **Insert dataset head or summary as an image** | |
|  | |
| # ๐ 3. Baseline Model | |
| In this phase we studied the patterns behind delay behavior. | |
| ### What we analyzed: | |
| * **Distribution of arrival delays** | |
| Helps understand skew, outliers, and how reasonable our classification threshold is. | |
| * **Correlation between numerical features** | |
| Found that distance and scheduled times impact delays but not extremely strongly. | |
| * **Delay behavior by airline** | |
| Some airlines have significantly more variability in delays. | |
| * **Time of day vs delay** | |
| Late-day flights tend to accumulate more delays. | |
| * **Outlier detection using Z-score** | |
| Removed unrealistic delays > ยฑ3 standard deviations. | |
| ### Why it matters: | |
| EDA allowed us to understand which features influence delays and how noisy the data is. | |
| This guided feature engineering and reduced overfitting risk. | |
| โฌ๏ธ | |
| ### **Place graphs here** | |
|  | |
|  | |
|  | |
| # ๐ ๏ธ 4. Feature Engineering | |
| Feature engineering was critical for improving model quality. | |
| ### Done in this step: | |
| #### **1. One-Hot Encoding for categorical features** | |
| * Airline | |
| * Origin airport | |
| * Destination airport | |
| * Day of Week | |
| * Cancellation field | |
| This expanded the dataset into thousands of columns but preserved categorical meaning. | |
| #### **2. Scaling important numerical fields** | |
| * Distance | |
| * CRSDepTime | |
| * CRSArrTime | |
| * AirTime | |
| Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges. | |
| #### **3. PCA (optional)** | |
| Used only for visualization; helped validate that the classes are somewhat separable. | |
| #### **4. K-Means clustering (optional exploratory step)** | |
| Cluster labels added as an experimental feature to see if they help models (they had mild impact). | |
| โฌ๏ธ | |
| ### **Place FE graphs here** | |
|  | |
|  | |
| # ๐ค 5. Models Trained | |
| We compared **three supervised classification models**: | |
| ### โ Logistic Regression | |
| * Simple baseline | |
| * Fast, linear, interpretable | |
| * Surprisingly produced perfect predictions (overfitting to clean, thresholded labels) | |
| ### โ Random Forest Classifier | |
| * Non-linear | |
| * Handles high-dimensional data | |
| * Good but struggled with high-delay recall | |
| ### โ Gradient Boosting Classifier | |
| * Ensemble of weak learners | |
| * Best real-world performance | |
| * Most balanced precisionโrecall | |
| * Strong against noise | |
| * Best generalization to unseen data | |
| โฌ๏ธ | |
| ### **Insert models summary image** | |
|  | |
|  | |
|  | |
| # ๐ 6. Winning Model | |
| The selected model is: | |
| # **๐ Gradient Boosting Classifier** | |
| ### Why this one? | |
| * Best tradeoff between false positives and false negatives | |
| * Highest real F1-score | |
| * Handles imbalanced patterns better | |
| * Robust to feature noise and outliers | |
| * Most realistic generalization | |
| ## 7. Regression-to-Classification | |
| ### 7.1 Creating Classes from the Numeric Target (Median Split) | |
| In this part we reframed the original regression target **ArrDelay** into a | |
| binary classification target. | |
| We computed the **median arrival delay on the training set** (โ โ5 minutes) and | |
| used it as a threshold: | |
| - **Class 0 โ Low delay:** `ArrDelay < median` | |
| (flight is on time or earlier than a typical flight in the dataset). | |
| - **Class 1 โ High delay:** `ArrDelay โฅ median` | |
| (flight is more delayed than a typical flight). | |
| The same rule was applied to both **train and test** targets, using the **same | |
| engineered features** as in the regression part. | |
| This keeps the classification task aligned with the original question: | |
| > *โHow large will the arrival delay be?โ* | |
| now phrased as | |
| > *โWill this flight have a higher-than-typical delay or not?โ* | |
| ### 7.2 Checking Class Balance | |
| After creating the classes, we examined their distribution: | |
| - **Training set:** | |
| about **50.6% High delay (Class 1)** and **49.4% Low delay (Class 0)**. | |
| - **Test set:** | |
| about **51.3% Low delay (Class 0)** and **48.7% High delay (Class 1)**. | |
| The classes are therefore **well balanced**, and no class is clearly | |
| under-represented. | |
| Because of this balance, **accuracy** is already informative, but to avoid | |
| being misled in edge cases and to keep the focus on the โHigh delayโ class, | |
| we mainly compared models using the **F1-score** (which combines precision and | |
| recall for the positive class). | |
| ๐ *Here I will insert a bar plot (or table screenshot) of the class | |
| distribution in train/test.* | |
|  | |
|  | |
| ## 8. Train & Evaluate Classification Models | |
| ### 8.1 Precision vs. Recall โ What Matters More? | |
| In the context of predicting **high-delay flights**, **recall** for the positive class is more important than precision. | |
| The reason: | |
| Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging | |
| an on-time flight as delayed (false positive). | |
| A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions, | |
| while a false alarm only causes minor adjustments like extra buffer time. | |
| --- | |
| ### 8.1 False Positives vs. False Negatives โ Which Is Worse? | |
| - A **false positive** means predicting โhigh delayโ when the flight is actually low-delay. | |
| - A **false negative** means predicting โlow delayโ when the flight is actually highly delayed. | |
| In our task, **false negatives are more critical**, because they leave planners unprepared for major delays. | |
| False positives are less harmful โ they may cause unnecessary caution, but do not create operational failures. | |
| --- | |
| ### 8.2 Training Three Classification Models | |
| We trained and evaluated three different models from scikit-learn, using the same engineered features | |
| and the binary target created in Part 7: | |
| 1. **Logistic Regression** | |
| 2. **Random Forest Classifier** | |
| 3. **Gradient Boosting Classifier** | |
| ๐ *Insert model training diagram or screenshots of code here (optional).* | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| ### 8.3 Model Evaluation | |
| For each model we generated: | |
| - `classification_report` (precision, recall, F1-score, support) | |
| - Confusion matrix | |
| - Interpretation of the types of errors the model makes | |
| Below is a summary of the results: | |
| #### **Logistic Regression** | |
| - Achieved **perfect classification** on the test set (F1 = 1.00). | |
| - The confusion matrix shows **0 errors**. | |
| - This suggests the engineered features were highly separable. | |
| #### **Random Forest Classifier** | |
| - F1-score โ **0.79** | |
| - Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay). | |
| - Confusion matrix shows the model tends to **miss high-delay flights** (false negatives). | |
| #### **Gradient Boosting Classifier** | |
| - F1-score โ **0.85** | |
| - Better balance between precision and recall compared to Random Forest. | |
| - Fewer false negatives than Random Forest and more consistent performance overall. | |
| ### 8.3 Which Model Performs Best โ and Why? | |
| The **best model is the Logistic Regression**, because: | |
| - It achieves **perfect predictive performance** on this dataset. | |
| - It cleanly separates the engineered feature space into the two classes. | |
| - It avoids the false negatives that are most critical in this task. | |
| - Its confusion matrix shows **zero misclassifications**. | |
| While this may indicate a highly separable dataset rather than model superiority alone, | |
| within the scope of this assignment **it is the clear winner**. | |
| --- | |
| ### 8.4 Winner: Exporting and Uploading the Model | |
| We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository: | |
| - **File:** `winning_classifier_model.pkl` | |
| - Stored alongside the earlier regression winning model file: | |
| - `winning_model.pkl` | |
| Both files live in the same HuggingFace model repository as required. | |
| # ๐ฅ 9. Video Presentation | |
| Your recording should include: | |
| * Quick dataset overview | |
| * Key EDA takeaways | |
| * How you encoded and engineered features | |
| * Explanation of each model | |
| * Confusion matrices | |
| * Why Gradient Boosting won | |
| * Summary of lessons learned | |