Update README.md

b669d1a verified 2 months ago

5.18 kB

	https://drive.google.com/file/d/16rbiBsJlo9gm5-Mq28ctsXCEuzfLZA5E/view?usp=drive_link

	Flight Delay Prediction — Full Project (Parts 1-8) -

	Overview -
	- This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
	- The work is structured into eight clear stages, following the assignment’s required workflow.

	1. Dataset Overview -
	- The dataset contains ~96K domestic U.S. flights.
	It includes:
	- Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
	- Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
	- Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
	- Target variable: ARRIVAL_DELAY (minutes)
	- Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.

	2. Exploratory Data Analysis (EDA) -
	- Main steps performed:
	- Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
	- Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
	- Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
	- Compared delays between airlines.
	- Visualized relationship between distance and delay.
	- Key Findings:
	- Certain months show heavier congestion.
	- Evening flights have systematically higher delays (“snowball effect”).
	- Airlines differ strongly in punctuality.
	- Distance has almost no explanatory power.
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/rCsDltEV7JMkxObEGve69.png)
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)

	3. Baseline Regression Model -
	- Steps completed:
	- Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
	- Used only information available before takeoff.
	- Trained a simple Linear Regression model.
	- Evaluated the model using MAE, MSE, RMSE, and R².
	- Results (Baseline):
	- RMSE ≈ 9.23 minutes
	- R² ≈ 0.88
	- Train/test scores were close - no overfitting.

	4. Feature Engineering -
	- Performed multiple transformations to enhance model performance:
	- 4.1 Encoding -
	- One-Hot Encoding - AIRLINE
	- Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
	- 4.2 New Features -
	- IS_WEEKEND — captures weekend travel differences
	- 4.3 Clustering -
	- Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
	- Added new feature: FLIGHT_CLUSTER
	- 4.4 Dimensionality Reduction -
	- PCA visualization to validate that clusters form meaningful groups
	- 4.5 Scaling -
	- Removed leakage / irrelevant fields
	- Scaled the final 33 features using StandardScaler
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)

	5. Improved Regression Models -
	- Trained three models on the engineered dataset:
	- Linear Regression (Improved)
	- Random Forest Regressor
	- Gradient Boosting Regressor
	- Results:
	- Gradient Boosting achieved best performance
	- RMSE ≈ 9.04
	- R² ≈ 0.89
	- This improves over the baseline because tree-based models capture non-linear relationships.
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)

	6. Winning Regression Model + Deployment -
	- Selected Gradient Boosting Regressor as winner.
	- Exported it using pickle:
	- with open("winning_model.pkl", "wb") as f:
	- pickle.dump(best_model, f)

	- Uploaded to a dedicated HuggingFace model repository.

	7. Regression - Classification -
	- 7.1 Creating Classes -
	- Converted arrival delay into 3 classes using quantile binning:
	- Class 0: lowest 33% delays
	- Class 1: middle 33%
	- Class 2: highest 33%
	- Why?
	- This ensures balanced classes and avoids distortions caused by skewed delay distributions.
	- 7.2 Class Balance Check -
	- Class distribution remained balanced (≈ 32–36% per class).
	- Therefore:
	- Accuracy is meaningful
	- Also tracked macro-F1 to ensure fair performance across classes
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)

	8. Classification Models -
	- 8.1 Precision vs Recall -
	- We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
	- 8.2 Models Trained
	- Logistic Regression
	- Random Forest Classifier
	- Gradient Boosting Classifier
	- All trained on the engineered features.
	- 8.3 Evaluation
	- For each model:
	- Classification report (precision, recall, F1-score)
	- Confusion matrix
	- Analysis of error patterns
	- Best model (macro F1): Logistic Regression
	- Even though simple, it produced the most balanced performance across all classes.
	![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
	- 8.4 Exporting the Winning Classifier
	- with open("winning_classifier.pkl", "wb") as f:
	- pickle.dump(best_cls_model, f)
	- Uploaded the classifier to the same HF repository as required.