https://drive.google.com/file/d/16rbiBsJlo9gm5-Mq28ctsXCEuzfLZA5E/view?usp=drive_link

Flight Delay Prediction — Full Project (Parts 1-8) -

Overview -
- This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
- The work is structured into eight clear stages, following the assignment’s required workflow.

1. Dataset Overview -
- The dataset contains ~96K domestic U.S. flights.
It includes:
- Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
- Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
- Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
- Target variable: ARRIVAL_DELAY (minutes)
- Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.

2. Exploratory Data Analysis (EDA) -
- Main steps performed:
- Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
- Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
- Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
- Compared delays between airlines.
- Visualized relationship between distance and delay.
- Key Findings:
- Certain months show heavier congestion.
- Evening flights have systematically higher delays (“snowball effect”).
- Airlines differ strongly in punctuality.
- Distance has almost no explanatory power.
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/rCsDltEV7JMkxObEGve69.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)

3. Baseline Regression Model -
- Steps completed:
- Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
- Used only information available before takeoff.
- Trained a simple Linear Regression model.
- Evaluated the model using MAE, MSE, RMSE, and R².
- Results (Baseline):
- RMSE ≈ 9.23 minutes
- R² ≈ 0.88
- Train/test scores were close - no overfitting.

4. Feature Engineering -
- Performed multiple transformations to enhance model performance:
- 4.1 Encoding -
- One-Hot Encoding - AIRLINE
- Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
- 4.2 New Features -
- IS_WEEKEND — captures weekend travel differences
- 4.3 Clustering -
- Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
- Added new feature: FLIGHT_CLUSTER
- 4.4 Dimensionality Reduction -
- PCA visualization to validate that clusters form meaningful groups
- 4.5 Scaling -
- Removed leakage / irrelevant fields
- Scaled the final 33 features using StandardScaler
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)

5. Improved Regression Models -
- Trained three models on the engineered dataset:
- Linear Regression (Improved)
- Random Forest Regressor
- Gradient Boosting Regressor
- Results:
- Gradient Boosting achieved best performance
- RMSE ≈ 9.04
- R² ≈ 0.89
- This improves over the baseline because tree-based models capture non-linear relationships.
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)

6. Winning Regression Model + Deployment -
- Selected Gradient Boosting Regressor as winner.
- Exported it using pickle:
- with open("winning_model.pkl", "wb") as f:
  -  pickle.dump(best_model, f)
  
- Uploaded to a dedicated HuggingFace model repository.

7. Regression - Classification -
- 7.1 Creating Classes -
- Converted arrival delay into 3 classes using quantile binning:
- Class 0: lowest 33% delays
- Class 1: middle 33%
- Class 2: highest 33%
- Why?
- This ensures balanced classes and avoids distortions caused by skewed delay distributions.
- 7.2 Class Balance Check -
- Class distribution remained balanced (≈ 32–36% per class).
- Therefore:
- Accuracy is meaningful
- Also tracked macro-F1 to ensure fair performance across classes
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)

8. Classification Models -
- 8.1 Precision vs Recall -
- We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
- 8.2 Models Trained
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
- All trained on the engineered features.
- 8.3 Evaluation
- For each model:
- Classification report (precision, recall, F1-score)
- Confusion matrix
- Analysis of error patterns
- Best model (macro F1): Logistic Regression
- Even though simple, it produced the most balanced performance across all classes.
![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
- 8.4 Exporting the Winning Classifier
- with open("winning_classifier.pkl", "wb") as f:
  -  pickle.dump(best_cls_model, f)
- Uploaded the classifier to the same HF repository as required.