File size: 13,881 Bytes
8901fa7 59eb913 8901fa7 3d1d27b 8901fa7 2bf2ddf 3d1d27b 8901fa7 3d1d27b 8901fa7 3d1d27b 8901fa7 3d1d27b 8901fa7 ec6bb00 8901fa7 ec6bb00 8901fa7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 |
## ๐ฅ Project Video Walkthrough
<video controls width="720">
<source src="https://huggingface.co/maorsoul/flight-delay-predictor/resolve/main/video1534610661.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
# โ๏ธ Flight Delay Predictor
## ๐ Dataset Overview
For this project, I worked with the **2018 US Flight Delays & Cancellations** dataset.
This dataset contains detailed information about **over 7 million domestic flights in the United States**, including:
* Flight dates and times
* Departure and arrival delays
* Airline carrier codes
* Origin and destination airports
* Distance and air time
* Cancellation and diversion information
* Various time-related features (month, day, day of week, scheduled times, etc.)
To keep the project computationally manageable, I selected a **random sample of 20,000 rows** from the full dataset.
This sample size still preserves meaningful variation in delays, airlines, and airports, allowing for effective modeling without heavy computation.
**Main target variable:**
`ArrDelay` โ the arrival delay in minutes.
This continuous variable was used first for a regression problem, and later converted into classes for a classification task.
**Goal of the project:**
1. Predict arrival delay using regression models.
2. Reframe the problem into classification (high delay vs. low delay).
3. Compare models and deploy the best-performing classifier/regressor to HuggingFace.
The project walks through the full ML process:
* Data loading & cleaning
* EDA
* Feature engineering
* Model training
* Evaluation
* Selecting a winner
* Exporting the model
# ๐ 2. Exploratory Data Analysis (EDA)
In this section we explored:
* Total rows, columns
* Data types
* Missing values
* Basic statistical patterns
* Target variable behavior before classification
**Main actions performed:**
* Loaded 20,000 rows from the 2018 dataset
* Removed irrelevant fields (like tail IDs)
* Verified missing values and cleaned them
* Verified numerical ranges to detect odd values
* Converted original delay (`ArrDelay`) into the classification target `y_class`
* Split into 80% train, 20% test
โฌ๏ธ









### **Insert dataset head or summary as an image**

# ๐ 3. Baseline Model
In this phase we studied the patterns behind delay behavior.
### What we analyzed:
* **Distribution of arrival delays**
Helps understand skew, outliers, and how reasonable our classification threshold is.
* **Correlation between numerical features**
Found that distance and scheduled times impact delays but not extremely strongly.
* **Delay behavior by airline**
Some airlines have significantly more variability in delays.
* **Time of day vs delay**
Late-day flights tend to accumulate more delays.
* **Outlier detection using Z-score**
Removed unrealistic delays > ยฑ3 standard deviations.
### Why it matters:
EDA allowed us to understand which features influence delays and how noisy the data is.
This guided feature engineering and reduced overfitting risk.
โฌ๏ธ
### **Place graphs here**



# ๐ ๏ธ 4. Feature Engineering
Feature engineering was critical for improving model quality.
### Done in this step:
#### **1. One-Hot Encoding for categorical features**
* Airline
* Origin airport
* Destination airport
* Day of Week
* Cancellation field
This expanded the dataset into thousands of columns but preserved categorical meaning.
#### **2. Scaling important numerical fields**
* Distance
* CRSDepTime
* CRSArrTime
* AirTime
Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges.
#### **3. PCA (optional)**
Used only for visualization; helped validate that the classes are somewhat separable.
#### **4. K-Means clustering (optional exploratory step)**
Cluster labels added as an experimental feature to see if they help models (they had mild impact).
โฌ๏ธ
### **Place FE graphs here**


# ๐ค 5. Models Trained
We compared **three supervised classification models**:
### โ Logistic Regression
* Simple baseline
* Fast, linear, interpretable
* Surprisingly produced perfect predictions (overfitting to clean, thresholded labels)
### โ Random Forest Classifier
* Non-linear
* Handles high-dimensional data
* Good but struggled with high-delay recall
### โ Gradient Boosting Classifier
* Ensemble of weak learners
* Best real-world performance
* Most balanced precisionโrecall
* Strong against noise
* Best generalization to unseen data
โฌ๏ธ
### **Insert models summary image**



# ๐ 6. Winning Model
The selected model is:
# **๐ Gradient Boosting Classifier**
### Why this one?
* Best tradeoff between false positives and false negatives
* Highest real F1-score
* Handles imbalanced patterns better
* Robust to feature noise and outliers
* Most realistic generalization
## 7. Regression-to-Classification
### 7.1 Creating Classes from the Numeric Target (Median Split)
In this part we reframed the original regression target **ArrDelay** into a
binary classification target.
We computed the **median arrival delay on the training set** (โ โ5 minutes) and
used it as a threshold:
- **Class 0 โ Low delay:** `ArrDelay < median`
(flight is on time or earlier than a typical flight in the dataset).
- **Class 1 โ High delay:** `ArrDelay โฅ median`
(flight is more delayed than a typical flight).
The same rule was applied to both **train and test** targets, using the **same
engineered features** as in the regression part.
This keeps the classification task aligned with the original question:
> *โHow large will the arrival delay be?โ*
now phrased as
> *โWill this flight have a higher-than-typical delay or not?โ*
### 7.2 Checking Class Balance
After creating the classes, we examined their distribution:
- **Training set:**
about **50.6% High delay (Class 1)** and **49.4% Low delay (Class 0)**.
- **Test set:**
about **51.3% Low delay (Class 0)** and **48.7% High delay (Class 1)**.
The classes are therefore **well balanced**, and no class is clearly
under-represented.
Because of this balance, **accuracy** is already informative, but to avoid
being misled in edge cases and to keep the focus on the โHigh delayโ class,
we mainly compared models using the **F1-score** (which combines precision and
recall for the positive class).
๐ *Here I will insert a bar plot (or table screenshot) of the class
distribution in train/test.*


## 8. Train & Evaluate Classification Models
### 8.1 Precision vs. Recall โ What Matters More?
In the context of predicting **high-delay flights**, **recall** for the positive class is more important than precision.
The reason:
Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging
an on-time flight as delayed (false positive).
A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions,
while a false alarm only causes minor adjustments like extra buffer time.
---
### 8.1 False Positives vs. False Negatives โ Which Is Worse?
- A **false positive** means predicting โhigh delayโ when the flight is actually low-delay.
- A **false negative** means predicting โlow delayโ when the flight is actually highly delayed.
In our task, **false negatives are more critical**, because they leave planners unprepared for major delays.
False positives are less harmful โ they may cause unnecessary caution, but do not create operational failures.
---
### 8.2 Training Three Classification Models
We trained and evaluated three different models from scikit-learn, using the same engineered features
and the binary target created in Part 7:
1. **Logistic Regression**
2. **Random Forest Classifier**
3. **Gradient Boosting Classifier**
๐ *Insert model training diagram or screenshots of code here (optional).*






### 8.3 Model Evaluation
For each model we generated:
- `classification_report` (precision, recall, F1-score, support)
- Confusion matrix
- Interpretation of the types of errors the model makes
Below is a summary of the results:
#### **Logistic Regression**
- Achieved **perfect classification** on the test set (F1 = 1.00).
- The confusion matrix shows **0 errors**.
- This suggests the engineered features were highly separable.
#### **Random Forest Classifier**
- F1-score โ **0.79**
- Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay).
- Confusion matrix shows the model tends to **miss high-delay flights** (false negatives).
#### **Gradient Boosting Classifier**
- F1-score โ **0.85**
- Better balance between precision and recall compared to Random Forest.
- Fewer false negatives than Random Forest and more consistent performance overall.
### 8.3 Which Model Performs Best โ and Why?
The **best model is the Logistic Regression**, because:
- It achieves **perfect predictive performance** on this dataset.
- It cleanly separates the engineered feature space into the two classes.
- It avoids the false negatives that are most critical in this task.
- Its confusion matrix shows **zero misclassifications**.
While this may indicate a highly separable dataset rather than model superiority alone,
within the scope of this assignment **it is the clear winner**.
---
### 8.4 Winner: Exporting and Uploading the Model
We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository:
- **File:** `winning_classifier_model.pkl`
- Stored alongside the earlier regression winning model file:
- `winning_model.pkl`
Both files live in the same HuggingFace model repository as required.
# ๐ฅ 9. Video Presentation
Your recording should include:
* Quick dataset overview
* Key EDA takeaways
* How you encoded and engineered features
* Explanation of each model
* Confusion matrices
* Why Gradient Boosting won
* Summary of lessons learned
|