File size: 13,881 Bytes
8901fa7
59eb913
 
 
 
 
 
8901fa7
 
 
3d1d27b
8901fa7
2bf2ddf
3d1d27b
8901fa7
3d1d27b
 
 
 
 
 
 
8901fa7
3d1d27b
 
8901fa7
3d1d27b
 
 
 
 
 
 
 
 
8901fa7
 
 
 
 
 
 
 
 
 
 
 
ec6bb00
8901fa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec6bb00
8901fa7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387

## ๐ŸŽฅ Project Video Walkthrough

<video controls width="720">
  <source src="https://huggingface.co/maorsoul/flight-delay-predictor/resolve/main/video1534610661.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

# โœˆ๏ธ Flight Delay Predictor

## ๐Ÿ“Œ Dataset Overview

For this project, I worked with the **2018 US Flight Delays & Cancellations** dataset.
This dataset contains detailed information about **over 7 million domestic flights in the United States**, including:

* Flight dates and times
* Departure and arrival delays
* Airline carrier codes
* Origin and destination airports
* Distance and air time
* Cancellation and diversion information
* Various time-related features (month, day, day of week, scheduled times, etc.)

To keep the project computationally manageable, I selected a **random sample of 20,000 rows** from the full dataset.
This sample size still preserves meaningful variation in delays, airlines, and airports, allowing for effective modeling without heavy computation.

**Main target variable:**
`ArrDelay` โ€“ the arrival delay in minutes.
This continuous variable was used first for a regression problem, and later converted into classes for a classification task.

**Goal of the project:**

1. Predict arrival delay using regression models.
2. Reframe the problem into classification (high delay vs. low delay).
3. Compare models and deploy the best-performing classifier/regressor to HuggingFace.

The project walks through the full ML process:

* Data loading & cleaning
* EDA
* Feature engineering
* Model training
* Evaluation
* Selecting a winner
* Exporting the model


# ๐Ÿ“Š 2. Exploratory Data Analysis (EDA)

In this section we explored:

* Total rows, columns
* Data types
* Missing values
* Basic statistical patterns
* Target variable behavior before classification

**Main actions performed:**

* Loaded 20,000 rows from the 2018 dataset
* Removed irrelevant fields (like tail IDs)
* Verified missing values and cleaned them
* Verified numerical ranges to detect odd values
* Converted original delay (`ArrDelay`) into the classification target `y_class`
* Split into 80% train, 20% test

โฌ‡๏ธ

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.37.33](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/H5TkmTdvamGzCX3tnkWbK.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.38.47](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ATuj1DhNFu4IOKADBVvfT.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.05](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/uVzJysvmUNKrI6dGyDxJU.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/dFULLmyeCowD54qkHMV3J.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ecNxcebQio2SOgl63r93a.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.39.57](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-G9t6hG5_-q9pBHxqN7rT.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.40.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cL1WcEpM2edSFbPHKSiUo.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.40.19](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/oY-wiihgzmlMtMvqzIZFK.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.40.29](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cYOZt4qv4fOxWr8RxQkfu.png)


### **Insert dataset head or summary as an image**


![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.41.55](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/OzNiiXLyr8accJYlArisL.png)


# ๐Ÿ” 3. Baseline Model

In this phase we studied the patterns behind delay behavior.

### What we analyzed:

* **Distribution of arrival delays**
  Helps understand skew, outliers, and how reasonable our classification threshold is.

* **Correlation between numerical features**
  Found that distance and scheduled times impact delays but not extremely strongly.

* **Delay behavior by airline**
  Some airlines have significantly more variability in delays.

* **Time of day vs delay**
  Late-day flights tend to accumulate more delays.

* **Outlier detection using Z-score**
  Removed unrealistic delays > ยฑ3 standard deviations.

### Why it matters:

EDA allowed us to understand which features influence delays and how noisy the data is.
This guided feature engineering and reduced overfitting risk.

โฌ‡๏ธ

### **Place graphs here**

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.44.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/M3ETCvLN0Rf_AItthFbw3.png)


![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.44.28](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-m8eDsCZWlH-AxGvrtZgS.png)


![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.44.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/rXCdxRxcnJap6b9U28-d9.png)


# ๐Ÿ› ๏ธ 4. Feature Engineering

Feature engineering was critical for improving model quality.

### Done in this step:

#### **1. One-Hot Encoding for categorical features**

* Airline
* Origin airport
* Destination airport
* Day of Week
* Cancellation field

This expanded the dataset into thousands of columns but preserved categorical meaning.

#### **2. Scaling important numerical fields**

* Distance
* CRSDepTime
* CRSArrTime
* AirTime

Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges.

#### **3. PCA (optional)**

Used only for visualization; helped validate that the classes are somewhat separable.

#### **4. K-Means clustering (optional exploratory step)**

Cluster labels added as an experimental feature to see if they help models (they had mild impact).

โฌ‡๏ธ

### **Place FE graphs here**

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.45.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/T7pjOhFJL1Zn54OroFK9T.png)


![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.45.26](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/Tq6yVLGH-w8tLty1rbNQG.png)


# ๐Ÿค– 5. Models Trained

We compared **three supervised classification models**:

### โœ” Logistic Regression

* Simple baseline
* Fast, linear, interpretable
* Surprisingly produced perfect predictions (overfitting to clean, thresholded labels)

### โœ” Random Forest Classifier

* Non-linear
* Handles high-dimensional data
* Good but struggled with high-delay recall

### โœ” Gradient Boosting Classifier

* Ensemble of weak learners
* Best real-world performance
* Most balanced precisionโ€“recall
* Strong against noise
* Best generalization to unseen data

โฌ‡๏ธ

### **Insert models summary image**

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.45.46](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/LWGxTGHU-gYW2QRFOjhm_.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.46.01](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/WelQ-fbqravyTW1nYv4pW.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.46.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/kA863njS1KJj4ZvvIFvAq.png)


# ๐Ÿ† 6. Winning Model

The selected model is:

# **๐ŸŒŸ Gradient Boosting Classifier**

### Why this one?

* Best tradeoff between false positives and false negatives
* Highest real F1-score
* Handles imbalanced patterns better
* Robust to feature noise and outliers
* Most realistic generalization

## 7. Regression-to-Classification

### 7.1 Creating Classes from the Numeric Target (Median Split)

In this part we reframed the original regression target **ArrDelay** into a
binary classification target.

We computed the **median arrival delay on the training set** (โ‰ˆ โˆ’5 minutes) and
used it as a threshold:

- **Class 0 โ€“ Low delay:** `ArrDelay < median`  
  (flight is on time or earlier than a typical flight in the dataset).
- **Class 1 โ€“ High delay:** `ArrDelay โ‰ฅ median`  
  (flight is more delayed than a typical flight).

The same rule was applied to both **train and test** targets, using the **same
engineered features** as in the regression part.  
This keeps the classification task aligned with the original question:
> *โ€œHow large will the arrival delay be?โ€*  
now phrased as  
> *โ€œWill this flight have a higher-than-typical delay or not?โ€*


### 7.2 Checking Class Balance

After creating the classes, we examined their distribution:

- **Training set:**  
  about **50.6% High delay (Class 1)** and **49.4% Low delay (Class 0)**.
- **Test set:**  
  about **51.3% Low delay (Class 0)** and **48.7% High delay (Class 1)**.

The classes are therefore **well balanced**, and no class is clearly
under-represented.

Because of this balance, **accuracy** is already informative, but to avoid
being misled in edge cases and to keep the focus on the โ€œHigh delayโ€ class,  
we mainly compared models using the **F1-score** (which combines precision and
recall for the positive class).

๐Ÿ‘‰ *Here I will insert a bar plot (or table screenshot) of the class
distribution in train/test.*

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.55.24](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/fSo9lYbeNOK6_8qrBtFRc.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.55.39](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-kh6L76mQaE9tJv4nymxA.png)

## 8. Train & Evaluate Classification Models

### 8.1 Precision vs. Recall โ€” What Matters More?

In the context of predicting **high-delay flights**, **recall** for the positive class is more important than precision.

The reason:  
Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging
an on-time flight as delayed (false positive).  
A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions,
while a false alarm only causes minor adjustments like extra buffer time.

---

### 8.1 False Positives vs. False Negatives โ€” Which Is Worse?

- A **false positive** means predicting โ€œhigh delayโ€ when the flight is actually low-delay.  
- A **false negative** means predicting โ€œlow delayโ€ when the flight is actually highly delayed.

In our task, **false negatives are more critical**, because they leave planners unprepared for major delays.
False positives are less harmful โ€” they may cause unnecessary caution, but do not create operational failures.

---

### 8.2 Training Three Classification Models

We trained and evaluated three different models from scikit-learn, using the same engineered features
and the binary target created in Part 7:

1. **Logistic Regression**  
2. **Random Forest Classifier**  
3. **Gradient Boosting Classifier**

๐Ÿ‘‰ *Insert model training diagram or screenshots of code here (optional).*

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.57.44](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/vHzlqE8vnf7tRBxACgY-V.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.57.59](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o4D5WklP1INIFBvvubdf3.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.14](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o9L76PgbO7hWmyEZIfQHL.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/5BaZaCtq0RDU4Sg_kAneC.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.36](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/0YggL_58zalfn50WokKf0.png)

![ืฆื™ืœื•ื ืžืกืš 2025-11-29 ื‘-9.58.48](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/S896FbQSOUX4pQTl5ym4d.png)


### 8.3 Model Evaluation

For each model we generated:

- `classification_report` (precision, recall, F1-score, support)  
- Confusion matrix  
- Interpretation of the types of errors the model makes

Below is a summary of the results:

#### **Logistic Regression**
- Achieved **perfect classification** on the test set (F1 = 1.00).  
- The confusion matrix shows **0 errors**.  
- This suggests the engineered features were highly separable.


#### **Random Forest Classifier**
- F1-score โ‰ˆ **0.79**  
- Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay).  
- Confusion matrix shows the model tends to **miss high-delay flights** (false negatives).


#### **Gradient Boosting Classifier**
- F1-score โ‰ˆ **0.85**  
- Better balance between precision and recall compared to Random Forest.  
- Fewer false negatives than Random Forest and more consistent performance overall.


### 8.3 Which Model Performs Best โ€” and Why?

The **best model is the Logistic Regression**, because:

- It achieves **perfect predictive performance** on this dataset.
- It cleanly separates the engineered feature space into the two classes.
- It avoids the false negatives that are most critical in this task.
- Its confusion matrix shows **zero misclassifications**.

While this may indicate a highly separable dataset rather than model superiority alone,
within the scope of this assignment **it is the clear winner**.

---

### 8.4 Winner: Exporting and Uploading the Model

We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository:

- **File:** `winning_classifier_model.pkl`  
- Stored alongside the earlier regression winning model file:
  - `winning_model.pkl`

Both files live in the same HuggingFace model repository as required.



# ๐ŸŽฅ 9. Video Presentation

Your recording should include:

* Quick dataset overview
* Key EDA takeaways
* How you encoded and engineered features
* Explanation of each model
* Confusion matrices
* Why Gradient Boosting won
* Summary of lessons learned