YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🎥 Project Video Walkthrough

✈️ Flight Delay Predictor

📌 Dataset Overview

For this project, I worked with the 2018 US Flight Delays & Cancellations dataset. This dataset contains detailed information about over 7 million domestic flights in the United States, including:

Flight dates and times
Departure and arrival delays
Airline carrier codes
Origin and destination airports
Distance and air time
Cancellation and diversion information
Various time-related features (month, day, day of week, scheduled times, etc.)

To keep the project computationally manageable, I selected a random sample of 20,000 rows from the full dataset. This sample size still preserves meaningful variation in delays, airlines, and airports, allowing for effective modeling without heavy computation.

Main target variable: ArrDelay – the arrival delay in minutes. This continuous variable was used first for a regression problem, and later converted into classes for a classification task.

Goal of the project:

Predict arrival delay using regression models.
Reframe the problem into classification (high delay vs. low delay).
Compare models and deploy the best-performing classifier/regressor to HuggingFace.

The project walks through the full ML process:

Data loading & cleaning
EDA
Feature engineering
Model training
Evaluation
Selecting a winner
Exporting the model

📊 2. Exploratory Data Analysis (EDA)

In this section we explored:

Total rows, columns
Data types
Missing values
Basic statistical patterns
Target variable behavior before classification

Main actions performed:

Loaded 20,000 rows from the 2018 dataset
Removed irrelevant fields (like tail IDs)
Verified missing values and cleaned them
Verified numerical ranges to detect odd values
Converted original delay (ArrDelay) into the classification target y_class
Split into 80% train, 20% test

⬇️

Insert dataset head or summary as an image

🔍 3. Baseline Model

In this phase we studied the patterns behind delay behavior.

What we analyzed:

Distribution of arrival delays Helps understand skew, outliers, and how reasonable our classification threshold is.
Correlation between numerical features Found that distance and scheduled times impact delays but not extremely strongly.
Delay behavior by airline Some airlines have significantly more variability in delays.
Time of day vs delay Late-day flights tend to accumulate more delays.
Outlier detection using Z-score Removed unrealistic delays > ±3 standard deviations.

Why it matters:

EDA allowed us to understand which features influence delays and how noisy the data is. This guided feature engineering and reduced overfitting risk.

⬇️

Place graphs here

🛠️ 4. Feature Engineering

Feature engineering was critical for improving model quality.

Done in this step:

1. One-Hot Encoding for categorical features

Airline
Origin airport
Destination airport
Day of Week
Cancellation field

This expanded the dataset into thousands of columns but preserved categorical meaning.

2. Scaling important numerical fields

Distance
CRSDepTime
CRSArrTime
AirTime

Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges.

3. PCA (optional)

Used only for visualization; helped validate that the classes are somewhat separable.

4. K-Means clustering (optional exploratory step)

Cluster labels added as an experimental feature to see if they help models (they had mild impact).

⬇️

Place FE graphs here

🤖 5. Models Trained

We compared three supervised classification models:

✔ Logistic Regression

Simple baseline
Fast, linear, interpretable
Surprisingly produced perfect predictions (overfitting to clean, thresholded labels)

✔ Random Forest Classifier

Non-linear
Handles high-dimensional data
Good but struggled with high-delay recall

✔ Gradient Boosting Classifier

Ensemble of weak learners
Best real-world performance
Most balanced precision–recall
Strong against noise
Best generalization to unseen data

⬇️

Insert models summary image

🏆 6. Winning Model

The selected model is:

🌟 Gradient Boosting Classifier

Why this one?

Best tradeoff between false positives and false negatives
Highest real F1-score
Handles imbalanced patterns better
Robust to feature noise and outliers
Most realistic generalization

7. Regression-to-Classification

7.1 Creating Classes from the Numeric Target (Median Split)

In this part we reframed the original regression target ArrDelay into a binary classification target.

We computed the median arrival delay on the training set (≈ −5 minutes) and used it as a threshold:

Class 0 – Low delay: ArrDelay < median
(flight is on time or earlier than a typical flight in the dataset).
Class 1 – High delay: ArrDelay ≥ median
(flight is more delayed than a typical flight).

The same rule was applied to both train and test targets, using the same engineered features as in the regression part.
This keeps the classification task aligned with the original question:

“How large will the arrival delay be?”
now phrased as
“Will this flight have a higher-than-typical delay or not?”

7.2 Checking Class Balance

After creating the classes, we examined their distribution:

Training set:
about 50.6% High delay (Class 1) and 49.4% Low delay (Class 0).
Test set:
about 51.3% Low delay (Class 0) and 48.7% High delay (Class 1).

The classes are therefore well balanced, and no class is clearly under-represented.

Because of this balance, accuracy is already informative, but to avoid being misled in edge cases and to keep the focus on the “High delay” class,
we mainly compared models using the F1-score (which combines precision and recall for the positive class).

👉 Here I will insert a bar plot (or table screenshot) of the class distribution in train/test.

8. Train & Evaluate Classification Models

8.1 Precision vs. Recall — What Matters More?

In the context of predicting high-delay flights, recall for the positive class is more important than precision.

The reason:
Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging an on-time flight as delayed (false positive).
A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions, while a false alarm only causes minor adjustments like extra buffer time.

8.1 False Positives vs. False Negatives — Which Is Worse?

A false positive means predicting “high delay” when the flight is actually low-delay.
A false negative means predicting “low delay” when the flight is actually highly delayed.

In our task, false negatives are more critical, because they leave planners unprepared for major delays. False positives are less harmful — they may cause unnecessary caution, but do not create operational failures.

8.2 Training Three Classification Models

We trained and evaluated three different models from scikit-learn, using the same engineered features and the binary target created in Part 7:

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier

👉 Insert model training diagram or screenshots of code here (optional).

8.3 Model Evaluation

For each model we generated:

classification_report (precision, recall, F1-score, support)
Confusion matrix
Interpretation of the types of errors the model makes

Below is a summary of the results:

Logistic Regression

Achieved perfect classification on the test set (F1 = 1.00).
The confusion matrix shows 0 errors.
This suggests the engineered features were highly separable.

Random Forest Classifier

F1-score ≈ 0.79
Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay).
Confusion matrix shows the model tends to miss high-delay flights (false negatives).

Gradient Boosting Classifier

F1-score ≈ 0.85
Better balance between precision and recall compared to Random Forest.
Fewer false negatives than Random Forest and more consistent performance overall.

8.3 Which Model Performs Best — and Why?

The best model is the Logistic Regression, because:

It achieves perfect predictive performance on this dataset.
It cleanly separates the engineered feature space into the two classes.
It avoids the false negatives that are most critical in this task.
Its confusion matrix shows zero misclassifications.

While this may indicate a highly separable dataset rather than model superiority alone, within the scope of this assignment it is the clear winner.

8.4 Winner: Exporting and Uploading the Model

We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository:

File: winning_classifier_model.pkl
Stored alongside the earlier regression winning model file:
- winning_model.pkl

Both files live in the same HuggingFace model repository as required.

🎥 9. Video Presentation

Your recording should include:

Quick dataset overview
Key EDA takeaways
How you encoded and engineered features
Explanation of each model
Confusion matrices
Why Gradient Boosting won
Summary of lessons learned

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support