YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

https://drive.google.com/file/d/16rbiBsJlo9gm5-Mq28ctsXCEuzfLZA5E/view?usp=drive_link

Flight Delay Prediction — Full Project (Parts 1-8) -

Overview -

  • This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
  • The work is structured into eight clear stages, following the assignment’s required workflow.
  1. Dataset Overview -
  • The dataset contains ~96K domestic U.S. flights. It includes:
  • Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
  • Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
  • Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
  • Target variable: ARRIVAL_DELAY (minutes)
  • Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
  1. Exploratory Data Analysis (EDA) -
  • Main steps performed:
  • Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
  • Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
  • Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
  • Compared delays between airlines.
  • Visualized relationship between distance and delay.
  • Key Findings:
  • Certain months show heavier congestion.
  • Evening flights have systematically higher delays (“snowball effect”).
  • Airlines differ strongly in punctuality.
  • Distance has almost no explanatory power. image image
  1. Baseline Regression Model -
  • Steps completed:
  • Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
  • Used only information available before takeoff.
  • Trained a simple Linear Regression model.
  • Evaluated the model using MAE, MSE, RMSE, and R².
  • Results (Baseline):
  • RMSE ≈ 9.23 minutes
  • R² ≈ 0.88
  • Train/test scores were close - no overfitting.
  1. Feature Engineering -
  • Performed multiple transformations to enhance model performance:
  • 4.1 Encoding -
  • One-Hot Encoding - AIRLINE
  • Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
  • 4.2 New Features -
  • IS_WEEKEND — captures weekend travel differences
  • 4.3 Clustering -
  • Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
  • Added new feature: FLIGHT_CLUSTER
  • 4.4 Dimensionality Reduction -
  • PCA visualization to validate that clusters form meaningful groups
  • 4.5 Scaling -
  • Removed leakage / irrelevant fields
  • Scaled the final 33 features using StandardScaler image
  1. Improved Regression Models -
  • Trained three models on the engineered dataset:
  • Linear Regression (Improved)
  • Random Forest Regressor
  • Gradient Boosting Regressor
  • Results:
  • Gradient Boosting achieved best performance
  • RMSE ≈ 9.04
  • R² ≈ 0.89
  • This improves over the baseline because tree-based models capture non-linear relationships. image image
  1. Winning Regression Model + Deployment -
  • Selected Gradient Boosting Regressor as winner.

  • Exported it using pickle:

  • with open("winning_model.pkl", "wb") as f:

    • pickle.dump(best_model, f)
  • Uploaded to a dedicated HuggingFace model repository.

  1. Regression - Classification -
  • 7.1 Creating Classes -
  • Converted arrival delay into 3 classes using quantile binning:
  • Class 0: lowest 33% delays
  • Class 1: middle 33%
  • Class 2: highest 33%
  • Why?
  • This ensures balanced classes and avoids distortions caused by skewed delay distributions.
  • 7.2 Class Balance Check -
  • Class distribution remained balanced (≈ 32–36% per class).
  • Therefore:
  • Accuracy is meaningful
  • Also tracked macro-F1 to ensure fair performance across classes image
  1. Classification Models -
  • 8.1 Precision vs Recall -
  • We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
  • 8.2 Models Trained
  • Logistic Regression
  • Random Forest Classifier
  • Gradient Boosting Classifier
  • All trained on the engineered features.
  • 8.3 Evaluation
  • For each model:
  • Classification report (precision, recall, F1-score)
  • Confusion matrix
  • Analysis of error patterns
  • Best model (macro F1): Logistic Regression
  • Even though simple, it produced the most balanced performance across all classes. image
  • 8.4 Exporting the Winning Classifier
  • with open("winning_classifier.pkl", "wb") as f:
    • pickle.dump(best_cls_model, f)
  • Uploaded the classifier to the same HF repository as required.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support