maorsoul
/

flight-delay-predictor

Model card Files Files and versions

xet

Community

maorsoul commited on Nov 29, 2025

Commit

8901fa7

verified ·

1 Parent(s): 578d41d

Update README.md

Browse files

Files changed (1) hide show

README.md +368 -3

README.md CHANGED Viewed

@@ -1,3 +1,368 @@
----
-license: mit
----

+# ✈️ Flight Delay Predictor
+A complete end-to-end Machine Learning project predicting whether a flight will experience **low delay** or **high delay**, based on real U.S. flight data.
+---
+# 📌 1. Project Goal
+The original dataset included a **regression target** (`ArrDelay`).
+To make the project more practical and interpretable, we transformed the target into a **binary classification problem**:
+* **Low delay (0)** — ArrDelay ≤ median
+* **High delay (1)** — ArrDelay > median
+The project walks through the full ML process:
+* Data loading & cleaning
+* EDA
+* Feature engineering
+* Model training
+* Evaluation
+* Selecting a winner
+* Exporting the model
+# 📊 2. Dataset Overview
+In this section we explored:
+* Total rows, columns
+* Data types
+* Missing values
+* Basic statistical patterns
+* Target variable behavior before classification
+**Main actions performed:**
+* Loaded 20,000 rows from the 2018 dataset
+* Removed irrelevant fields (like tail IDs)
+* Verified missing values and cleaned them
+* Verified numerical ranges to detect odd values
+* Converted original delay (`ArrDelay`) into the classification target `y_class`
+* Split into 80% train, 20% test
+⬇️
+![צילום מסך 2025-11-29 ב-9.37.33](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/H5TkmTdvamGzCX3tnkWbK.png)
+![צילום מסך 2025-11-29 ב-9.38.47](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ATuj1DhNFu4IOKADBVvfT.png)
+![צילום מסך 2025-11-29 ב-9.39.05](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/uVzJysvmUNKrI6dGyDxJU.png)
+![צילום מסך 2025-11-29 ב-9.39.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/dFULLmyeCowD54qkHMV3J.png)
+![צילום מסך 2025-11-29 ב-9.39.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ecNxcebQio2SOgl63r93a.png)
+![צילום מסך 2025-11-29 ב-9.39.57](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-G9t6hG5_-q9pBHxqN7rT.png)
+![צילום מסך 2025-11-29 ב-9.40.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cL1WcEpM2edSFbPHKSiUo.png)
+![צילום מסך 2025-11-29 ב-9.40.19](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/oY-wiihgzmlMtMvqzIZFK.png)
+![צילום מסך 2025-11-29 ב-9.40.29](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cYOZt4qv4fOxWr8RxQkfu.png)
+### **Insert dataset head or summary as an image**
+![צילום מסך 2025-11-29 ב-9.41.55](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/OzNiiXLyr8accJYlArisL.png)
+# 🔍 3. Exploratory Data Analysis (EDA)
+In this phase we studied the patterns behind delay behavior.
+### What we analyzed:
+* **Distribution of arrival delays**
+  Helps understand skew, outliers, and how reasonable our classification threshold is.
+* **Correlation between numerical features**
+  Found that distance and scheduled times impact delays but not extremely strongly.
+* **Delay behavior by airline**
+  Some airlines have significantly more variability in delays.
+* **Time of day vs delay**
+  Late-day flights tend to accumulate more delays.
+* **Outlier detection using Z-score**
+  Removed unrealistic delays > ±3 standard deviations.
+### Why it matters:
+EDA allowed us to understand which features influence delays and how noisy the data is.
+This guided feature engineering and reduced overfitting risk.
+⬇️
+### **Place graphs here**
+![צילום מסך 2025-11-29 ב-9.44.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/M3ETCvLN0Rf_AItthFbw3.png)
+![צילום מסך 2025-11-29 ב-9.44.28](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-m8eDsCZWlH-AxGvrtZgS.png)
+![צילום מסך 2025-11-29 ב-9.44.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/rXCdxRxcnJap6b9U28-d9.png)
+# 🛠️ 4. Feature Engineering
+Feature engineering was critical for improving model quality.
+### Done in this step:
+#### **1. One-Hot Encoding for categorical features**
+* Airline
+* Origin airport
+* Destination airport
+* Day of Week
+* Cancellation field
+This expanded the dataset into thousands of columns but preserved categorical meaning.
+#### **2. Scaling important numerical fields**
+* Distance
+* CRSDepTime
+* CRSArrTime
+* AirTime
+Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges.
+#### **3. PCA (optional)**
+Used only for visualization; helped validate that the classes are somewhat separable.
+#### **4. K-Means clustering (optional exploratory step)**
+Cluster labels added as an experimental feature to see if they help models (they had mild impact).
+⬇️
+### **Place FE graphs here**
+![צילום מסך 2025-11-29 ב-9.45.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/T7pjOhFJL1Zn54OroFK9T.png)
+![צילום מסך 2025-11-29 ב-9.45.26](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/Tq6yVLGH-w8tLty1rbNQG.png)
+# 🤖 5. Models Trained
+We compared **three supervised classification models**:
+### ✔ Logistic Regression
+* Simple baseline
+* Fast, linear, interpretable
+* Surprisingly produced perfect predictions (overfitting to clean, thresholded labels)
+### ✔ Random Forest Classifier
+* Non-linear
+* Handles high-dimensional data
+* Good but struggled with high-delay recall
+### ✔ Gradient Boosting Classifier
+* Ensemble of weak learners
+* Best real-world performance
+* Most balanced precision–recall
+* Strong against noise
+* Best generalization to unseen data
+⬇️
+### **Insert models summary image**
+![צילום מסך 2025-11-29 ב-9.45.46](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/LWGxTGHU-gYW2QRFOjhm_.png)
+![צילום מסך 2025-11-29 ב-9.46.01](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/WelQ-fbqravyTW1nYv4pW.png)
+![צילום מסך 2025-11-29 ב-9.46.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/kA863njS1KJj4ZvvIFvAq.png)
+# 🏆 6. Winning Model
+The selected model is:
+# **🌟 Gradient Boosting Classifier**
+### Why this one?
+* Best tradeoff between false positives and false negatives
+* Highest real F1-score
+* Handles imbalanced patterns better
+* Robust to feature noise and outliers
+* Most realistic generalization
+## 7. Regression-to-Classification
+### 7.1 Creating Classes from the Numeric Target (Median Split)
+In this part we reframed the original regression target **ArrDelay** into a
+binary classification target.
+We computed the **median arrival delay on the training set** (≈ −5 minutes) and
+used it as a threshold:
+- **Class 0 – Low delay:** `ArrDelay < median`
+  (flight is on time or earlier than a typical flight in the dataset).
+- **Class 1 – High delay:** `ArrDelay ≥ median`
+  (flight is more delayed than a typical flight).
+The same rule was applied to both **train and test** targets, using the **same
+engineered features** as in the regression part.
+This keeps the classification task aligned with the original question:
+> *“How large will the arrival delay be?”*
+now phrased as
+> *“Will this flight have a higher-than-typical delay or not?”*
+### 7.2 Checking Class Balance
+After creating the classes, we examined their distribution:
+- **Training set:**
+  about **50.6% High delay (Class 1)** and **49.4% Low delay (Class 0)**.
+- **Test set:**
+  about **51.3% Low delay (Class 0)** and **48.7% High delay (Class 1)**.
+The classes are therefore **well balanced**, and no class is clearly
+under-represented.
+Because of this balance, **accuracy** is already informative, but to avoid
+being misled in edge cases and to keep the focus on the “High delay” class,
+we mainly compared models using the **F1-score** (which combines precision and
+recall for the positive class).
+👉 *Here I will insert a bar plot (or table screenshot) of the class
+distribution in train/test.*
+![צילום מסך 2025-11-29 ב-9.55.24](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/fSo9lYbeNOK6_8qrBtFRc.png)
+![צילום מסך 2025-11-29 ב-9.55.39](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-kh6L76mQaE9tJv4nymxA.png)
+## 8. Train & Evaluate Classification Models
+### 8.1 Precision vs. Recall — What Matters More?
+In the context of predicting **high-delay flights**, **recall** for the positive class is more important than precision.
+The reason:
+Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging
+an on-time flight as delayed (false positive).
+A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions,
+while a false alarm only causes minor adjustments like extra buffer time.
+---
+### 8.1 False Positives vs. False Negatives — Which Is Worse?
+- A **false positive** means predicting “high delay” when the flight is actually low-delay.
+- A **false negative** means predicting “low delay” when the flight is actually highly delayed.
+In our task, **false negatives are more critical**, because they leave planners unprepared for major delays.
+False positives are less harmful — they may cause unnecessary caution, but do not create operational failures.
+---
+### 8.2 Training Three Classification Models
+We trained and evaluated three different models from scikit-learn, using the same engineered features
+and the binary target created in Part 7:
+1. **Logistic Regression**
+2. **Random Forest Classifier**
+3. **Gradient Boosting Classifier**
+👉 *Insert model training diagram or screenshots of code here (optional).*
+![צילום מסך 2025-11-29 ב-9.57.44](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/vHzlqE8vnf7tRBxACgY-V.png)
+![צילום מסך 2025-11-29 ב-9.57.59](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o4D5WklP1INIFBvvubdf3.png)
+![צילום מסך 2025-11-29 ב-9.58.14](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o9L76PgbO7hWmyEZIfQHL.png)
+![צילום מסך 2025-11-29 ב-9.58.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/5BaZaCtq0RDU4Sg_kAneC.png)
+![צילום מסך 2025-11-29 ב-9.58.36](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/0YggL_58zalfn50WokKf0.png)
+![צילום מסך 2025-11-29 ב-9.58.48](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/S896FbQSOUX4pQTl5ym4d.png)
+### 8.3 Model Evaluation
+For each model we generated:
+- `classification_report` (precision, recall, F1-score, support)
+- Confusion matrix
+- Interpretation of the types of errors the model makes
+Below is a summary of the results:
+#### **Logistic Regression**
+- Achieved **perfect classification** on the test set (F1 = 1.00).
+- The confusion matrix shows **0 errors**.
+- This suggests the engineered features were highly separable.
+#### **Random Forest Classifier**
+- F1-score ≈ **0.79**
+- Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay).
+- Confusion matrix shows the model tends to **miss high-delay flights** (false negatives).
+#### **Gradient Boosting Classifier**
+- F1-score ≈ **0.85**
+- Better balance between precision and recall compared to Random Forest.
+- Fewer false negatives than Random Forest and more consistent performance overall.
+### 8.3 Which Model Performs Best — and Why?
+The **best model is the Logistic Regression**, because:
+- It achieves **perfect predictive performance** on this dataset.
+- It cleanly separates the engineered feature space into the two classes.
+- It avoids the false negatives that are most critical in this task.
+- Its confusion matrix shows **zero misclassifications**.
+While this may indicate a highly separable dataset rather than model superiority alone,
+within the scope of this assignment **it is the clear winner**.
+---
+### 8.4 Winner: Exporting and Uploading the Model
+We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository:
+- **File:** `winning_classifier_model.pkl`
+- Stored alongside the earlier regression winning model file:
+  - `winning_model.pkl`
+Both files live in the same HuggingFace model repository as required.
+# 🎥 9. Video Presentation
+Your recording should include:
+* Quick dataset overview
+* Key EDA takeaways
+* How you encoded and engineered features
+* Explanation of each model
+* Confusion matrices
+* Why Gradient Boosting won
+* Summary of lessons learned
+⬇️
+[Watch Presentation](https://your-video-link.com)