Barvero
/

flight-delay-predictor

Model card Files Files and versions

xet

Community

Barvero commited on Nov 28, 2025

Commit

bfecc75

verified ·

1 Parent(s): 9f85bad

Update README file.

Browse files

Files changed (1) hide show

README.md +53 -52

README.md CHANGED Viewed

@@ -1,26 +1,26 @@
 Flight Delay Prediction — Full Project (Parts 1-8) -
 Overview -
-This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
-The work is structured into eight clear stages, following the assignment’s required workflow.
 1. Dataset Overview -
-The dataset contains ~96K domestic U.S. flights.
 It includes:
-Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
-Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
-Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
-Target variable: ARRIVAL_DELAY (minutes)
-Goal:
-Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
 2. Exploratory Data Analysis (EDA) -
-Main steps performed:
 - Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
 - Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
 - Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
 - Compared delays between airlines.
 - Visualized relationship between distance and delay.
-Key Findings:
 - Certain months show heavier congestion.
 - Evening flights have systematically higher delays (“snowball effect”).
 - Airlines differ strongly in punctuality.
@@ -29,85 +29,86 @@ Key Findings:
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)
 3. Baseline Regression Model -
-Steps completed:
 - Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
 - Used only information available before takeoff.
 - Trained a simple Linear Regression model.
 - Evaluated the model using MAE, MSE, RMSE, and R².
-Results (Baseline):
 - RMSE ≈ 9.23 minutes
 - R² ≈ 0.88
 - Train/test scores were close - no overfitting.
 4. Feature Engineering -
-Performed multiple transformations to enhance model performance:
-4.1 Encoding -
-One-Hot Encoding - AIRLINE
-Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
-4.2 New Features -
-IS_WEEKEND — captures weekend travel differences
-4.3 Clustering -
-Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
-Added new feature: FLIGHT_CLUSTER
-4.4 Dimensionality Reduction -
-PCA visualization to validate that clusters form meaningful groups
-4.5 Scaling -
-Removed leakage / irrelevant fields
-Scaled the final 33 features using StandardScaler
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)
 5. Improved Regression Models -
-Trained three models on the engineered dataset:
 - Linear Regression (Improved)
 - Random Forest Regressor
 - Gradient Boosting Regressor
-Results:
-Gradient Boosting achieved best performance
 - RMSE ≈ 9.04
 - R² ≈ 0.89
-This improves over the baseline because tree-based models capture non-linear relationships.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)
 6. Winning Regression Model + Deployment -
 - Selected Gradient Boosting Regressor as winner.
 - Exported it using pickle:
-with open("winning_model.pkl", "wb") as f:
-    pickle.dump(best_model, f)
 - Uploaded to a dedicated HuggingFace model repository.
 7. Regression - Classification -
-7.1 Creating Classes -
-Converted arrival delay into 3 classes using quantile binning:
 - Class 0: lowest 33% delays
 - Class 1: middle 33%
 - Class 2: highest 33%
-Why?
-This ensures balanced classes and avoids distortions caused by skewed delay distributions.
-7.2 Class Balance Check -
-Class distribution remained balanced (≈ 32–36% per class).
-Therefore:
 - Accuracy is meaningful
 - Also tracked macro-F1 to ensure fair performance across classes
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)
 8. Classification Models -
-8.1 Precision vs Recall -
-We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
-8.2 Models Trained
 - Logistic Regression
 - Random Forest Classifier
 - Gradient Boosting Classifier
-All trained on the engineered features.
-8.3 Evaluation
-For each model:
 - Classification report (precision, recall, F1-score)
 - Confusion matrix
 - Analysis of error patterns
-Best model (macro F1): Logistic Regression
-Even though simple, it produced the most balanced performance across all classes.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
-8.4 Exporting the Winning Classifier
-with open("winning_classifier.pkl", "wb") as f:
-    pickle.dump(best_cls_model, f)
-Uploaded the classifier to the same HF repository as required.

 Flight Delay Prediction — Full Project (Parts 1-8) -
 Overview -
+- This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
+- The work is structured into eight clear stages, following the assignment’s required workflow.
 1. Dataset Overview -
+- The dataset contains ~96K domestic U.S. flights.
 It includes:
+- Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
+- Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
+- Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
+- Target variable: ARRIVAL_DELAY (minutes)
+- Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
 2. Exploratory Data Analysis (EDA) -
+- Main steps performed:
 - Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
 - Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
 - Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
 - Compared delays between airlines.
 - Visualized relationship between distance and delay.
+- Key Findings:
 - Certain months show heavier congestion.
 - Evening flights have systematically higher delays (“snowball effect”).
 - Airlines differ strongly in punctuality.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)
 3. Baseline Regression Model -
+- Steps completed:
 - Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
 - Used only information available before takeoff.
 - Trained a simple Linear Regression model.
 - Evaluated the model using MAE, MSE, RMSE, and R².
+- Results (Baseline):
 - RMSE ≈ 9.23 minutes
 - R² ≈ 0.88
 - Train/test scores were close - no overfitting.
 4. Feature Engineering -
+- Performed multiple transformations to enhance model performance:
+- 4.1 Encoding -
+- One-Hot Encoding - AIRLINE
+- Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
+- 4.2 New Features -
+- IS_WEEKEND — captures weekend travel differences
+- 4.3 Clustering -
+- Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
+- Added new feature: FLIGHT_CLUSTER
+- 4.4 Dimensionality Reduction -
+- PCA visualization to validate that clusters form meaningful groups
+- 4.5 Scaling -
+- Removed leakage / irrelevant fields
+- Scaled the final 33 features using StandardScaler
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)
 5. Improved Regression Models -
+- Trained three models on the engineered dataset:
 - Linear Regression (Improved)
 - Random Forest Regressor
 - Gradient Boosting Regressor
+- Results:
+- Gradient Boosting achieved best performance
 - RMSE ≈ 9.04
 - R² ≈ 0.89
+- This improves over the baseline because tree-based models capture non-linear relationships.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)
 6. Winning Regression Model + Deployment -
 - Selected Gradient Boosting Regressor as winner.
 - Exported it using pickle:
+- with open("winning_model.pkl", "wb") as f:
+  -  pickle.dump(best_model, f)
 - Uploaded to a dedicated HuggingFace model repository.
 7. Regression - Classification -
+- 7.1 Creating Classes -
+- Converted arrival delay into 3 classes using quantile binning:
 - Class 0: lowest 33% delays
 - Class 1: middle 33%
 - Class 2: highest 33%
+- Why?
+- This ensures balanced classes and avoids distortions caused by skewed delay distributions.
+- 7.2 Class Balance Check -
+- Class distribution remained balanced (≈ 32–36% per class).
+- Therefore:
 - Accuracy is meaningful
 - Also tracked macro-F1 to ensure fair performance across classes
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)
 8. Classification Models -
+- 8.1 Precision vs Recall -
+- We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
+- 8.2 Models Trained
 - Logistic Regression
 - Random Forest Classifier
 - Gradient Boosting Classifier
+- All trained on the engineered features.
+- 8.3 Evaluation
+- For each model:
 - Classification report (precision, recall, F1-score)
 - Confusion matrix
 - Analysis of error patterns
+- Best model (macro F1): Logistic Regression
+- Even though simple, it produced the most balanced performance across all classes.
 ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
+- 8.4 Exporting the Winning Classifier
+- with open("winning_classifier.pkl", "wb") as f:
+  -  pickle.dump(best_cls_model, f)
+- Uploaded the classifier to the same HF repository as required.