YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Flight Delay Prediction — Full Project (Parts 1-8) -

Overview -

This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
The work is structured into eight clear stages, following the assignment’s required workflow.

The dataset contains ~96K domestic U.S. flights. It includes:
Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
Target variable: ARRIVAL_DELAY (minutes)
Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.

Main steps performed:
Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
Compared delays between airlines.
Visualized relationship between distance and delay.
Key Findings:
Certain months show heavier congestion.
Evening flights have systematically higher delays (“snowball effect”).
Airlines differ strongly in punctuality.
Distance has almost no explanatory power.

Trained three models on the engineered dataset:
Linear Regression (Improved)
Random Forest Regressor
Gradient Boosting Regressor
Results:
Gradient Boosting achieved best performance
RMSE ≈ 9.04
R² ≈ 0.89
This improves over the baseline because tree-based models capture non-linear relationships.

7.1 Creating Classes -
Converted arrival delay into 3 classes using quantile binning:
Class 0: lowest 33% delays
Class 1: middle 33%
Class 2: highest 33%
Why?
This ensures balanced classes and avoids distortions caused by skewed delay distributions.
7.2 Class Balance Check -
Class distribution remained balanced (≈ 32–36% per class).
Therefore:
Accuracy is meaningful
Also tracked macro-F1 to ensure fair performance across classes

8.1 Precision vs Recall -
We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
8.2 Models Trained
Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier
All trained on the engineered features.
8.3 Evaluation
For each model:
Classification report (precision, recall, F1-score)
Confusion matrix
Analysis of error patterns
Best model (macro F1): Logistic Regression
Even though simple, it produced the most balanced performance across all classes.
8.4 Exporting the Winning Classifier
with open("winning_classifier.pkl", "wb") as f:
- pickle.dump(best_cls_model, f)
Uploaded the classifier to the same HF repository as required.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support