File size: 5,184 Bytes

b669d1a
 
9f85bad
bfecc75
9f85bad
bfecc75
 
91b22cd
 
bfecc75
91b22cd
bfecc75
 
 
 
 
91b22cd
 
bfecc75
91b22cd
 
 
 
 
bfecc75
91b22cd
 
 
 
 
 
 
 
bfecc75
91b22cd
 
 
 
bfecc75
91b22cd
 
 
 
 
bfecc75
 
 
 
 
 
 
 
 
 
 
 
 
 
91b22cd
 
 
bfecc75
91b22cd
 
 
bfecc75
 
91b22cd
 
bfecc75
91b22cd
 
 
 
 
 
bfecc75
 
 
91b22cd
 
 
bfecc75
 
91b22cd
 
 
bfecc75
 
 
 
 
91b22cd
 
 
 
 
bfecc75
 
 
91b22cd
 
 
bfecc75
 
 
91b22cd
 
 
bfecc75
 
91b22cd
bfecc75

https://drive.google.com/file/d/16rbiBsJlo9gm5-Mq28ctsXCEuzfLZA5E/view?usp=drive_link

Flight Delay Prediction — Full Project (Parts 1-8) -

Overview -
- This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
- The work is structured into eight clear stages, following the assignment’s required workflow.

1. Dataset Overview -
- The dataset contains ~96K domestic U.S. flights.
It includes:
- Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
- Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
- Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
- Target variable: ARRIVAL_DELAY (minutes)
- Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.

2. Exploratory Data Analysis (EDA) -
- Main steps performed:
- Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
- Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
- Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
- Compared delays between airlines.
- Visualized relationship between distance and delay.
- Key Findings:
- Certain months show heavier congestion.
- Evening flights have systematically higher delays (“snowball effect”).
- Airlines differ strongly in punctuality.
- Distance has almost no explanatory power.
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/rCsDltEV7JMkxObEGve69.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)

3. Baseline Regression Model -
- Steps completed:
- Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
- Used only information available before takeoff.
- Trained a simple Linear Regression model.
- Evaluated the model using MAE, MSE, RMSE, and R².
- Results (Baseline):
- RMSE ≈ 9.23 minutes
- R² ≈ 0.88
- Train/test scores were close - no overfitting.

4. Feature Engineering -
- Performed multiple transformations to enhance model performance:
- 4.1 Encoding -
- One-Hot Encoding - AIRLINE
- Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
- 4.2 New Features -
- IS_WEEKEND — captures weekend travel differences
- 4.3 Clustering -
- Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
- Added new feature: FLIGHT_CLUSTER
- 4.4 Dimensionality Reduction -
- PCA visualization to validate that clusters form meaningful groups
- 4.5 Scaling -
- Removed leakage / irrelevant fields
- Scaled the final 33 features using StandardScaler
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)

5. Improved Regression Models -
- Trained three models on the engineered dataset:
- Linear Regression (Improved)
- Random Forest Regressor
- Gradient Boosting Regressor
- Results:
- Gradient Boosting achieved best performance
- RMSE ≈ 9.04
- R² ≈ 0.89
- This improves over the baseline because tree-based models capture non-linear relationships.
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)

6. Winning Regression Model + Deployment -
- Selected Gradient Boosting Regressor as winner.
- Exported it using pickle:
- with open("winning_model.pkl", "wb") as f:
  -  pickle.dump(best_model, f)
  
- Uploaded to a dedicated HuggingFace model repository.

7. Regression - Classification -
- 7.1 Creating Classes -
- Converted arrival delay into 3 classes using quantile binning:
- Class 0: lowest 33% delays
- Class 1: middle 33%
- Class 2: highest 33%
- Why?
- This ensures balanced classes and avoids distortions caused by skewed delay distributions.
- 7.2 Class Balance Check -
- Class distribution remained balanced (≈ 32–36% per class).
- Therefore:
- Accuracy is meaningful
- Also tracked macro-F1 to ensure fair performance across classes
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)

8. Classification Models -
- 8.1 Precision vs Recall -
- We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
- 8.2 Models Trained
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
- All trained on the engineered features.
- 8.3 Evaluation
- For each model:
- Classification report (precision, recall, F1-score)
- Confusion matrix
- Analysis of error patterns
- Best model (macro F1): Logistic Regression
- Even though simple, it produced the most balanced performance across all classes.
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
- 8.4 Exporting the Winning Classifier
- with open("winning_classifier.pkl", "wb") as f:
  -  pickle.dump(best_cls_model, f)
- Uploaded the classifier to the same HF repository as required.