Update README.md

59eb913 verified about 1 month ago

13.9 kB


	## 🎥 Project Video Walkthrough

	<video controls width="720">
	<source src="https://huggingface.co/maorsoul/flight-delay-predictor/resolve/main/video1534610661.mp4" type="video/mp4">
	Your browser does not support the video tag.
	</video>

	# ✈️ Flight Delay Predictor

	## 📌 Dataset Overview

	For this project, I worked with the 2018 US Flight Delays & Cancellations dataset.
	This dataset contains detailed information about over 7 million domestic flights in the United States, including:

	* Flight dates and times
	* Departure and arrival delays
	* Airline carrier codes
	* Origin and destination airports
	* Distance and air time
	* Cancellation and diversion information
	* Various time-related features (month, day, day of week, scheduled times, etc.)

	To keep the project computationally manageable, I selected a random sample of 20,000 rows from the full dataset.
	This sample size still preserves meaningful variation in delays, airlines, and airports, allowing for effective modeling without heavy computation.

	Main target variable:
	`ArrDelay` – the arrival delay in minutes.
	This continuous variable was used first for a regression problem, and later converted into classes for a classification task.

	Goal of the project:

	1. Predict arrival delay using regression models.
	2. Reframe the problem into classification (high delay vs. low delay).
	3. Compare models and deploy the best-performing classifier/regressor to HuggingFace.

	The project walks through the full ML process:

	* Data loading & cleaning
	* EDA
	* Feature engineering
	* Model training
	* Evaluation
	* Selecting a winner
	* Exporting the model


	# 📊 2. Exploratory Data Analysis (EDA)

	In this section we explored:

	* Total rows, columns
	* Data types
	* Missing values
	* Basic statistical patterns
	* Target variable behavior before classification

	Main actions performed:

	* Loaded 20,000 rows from the 2018 dataset
	* Removed irrelevant fields (like tail IDs)
	* Verified missing values and cleaned them
	* Verified numerical ranges to detect odd values
	* Converted original delay (`ArrDelay`) into the classification target `y_class`
	* Split into 80% train, 20% test

	⬇️

	![צילום מסך 2025-11-29 ב-9.37.33](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/H5TkmTdvamGzCX3tnkWbK.png)

	![צילום מסך 2025-11-29 ב-9.38.47](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ATuj1DhNFu4IOKADBVvfT.png)

	![צילום מסך 2025-11-29 ב-9.39.05](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/uVzJysvmUNKrI6dGyDxJU.png)

	![צילום מסך 2025-11-29 ב-9.39.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/dFULLmyeCowD54qkHMV3J.png)

	![צילום מסך 2025-11-29 ב-9.39.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/ecNxcebQio2SOgl63r93a.png)

	![צילום מסך 2025-11-29 ב-9.39.57](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-G9t6hG5_-q9pBHxqN7rT.png)

	![צילום מסך 2025-11-29 ב-9.40.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cL1WcEpM2edSFbPHKSiUo.png)

	![צילום מסך 2025-11-29 ב-9.40.19](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/oY-wiihgzmlMtMvqzIZFK.png)

	![צילום מסך 2025-11-29 ב-9.40.29](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/cYOZt4qv4fOxWr8RxQkfu.png)


	### Insert dataset head or summary as an image


	![צילום מסך 2025-11-29 ב-9.41.55](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/OzNiiXLyr8accJYlArisL.png)


	# 🔍 3. Baseline Model

	In this phase we studied the patterns behind delay behavior.

	### What we analyzed:

	* Distribution of arrival delays
	Helps understand skew, outliers, and how reasonable our classification threshold is.

	* Correlation between numerical features
	Found that distance and scheduled times impact delays but not extremely strongly.

	* Delay behavior by airline
	Some airlines have significantly more variability in delays.

	* Time of day vs delay
	Late-day flights tend to accumulate more delays.

	* Outlier detection using Z-score
	Removed unrealistic delays > ±3 standard deviations.

	### Why it matters:

	EDA allowed us to understand which features influence delays and how noisy the data is.
	This guided feature engineering and reduced overfitting risk.

	⬇️

	### Place graphs here

	![צילום מסך 2025-11-29 ב-9.44.08](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/M3ETCvLN0Rf_AItthFbw3.png)


	![צילום מסך 2025-11-29 ב-9.44.28](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-m8eDsCZWlH-AxGvrtZgS.png)


	![צילום מסך 2025-11-29 ב-9.44.41](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/rXCdxRxcnJap6b9U28-d9.png)


	# 🛠️ 4. Feature Engineering

	Feature engineering was critical for improving model quality.

	### Done in this step:

	#### 1. One-Hot Encoding for categorical features

	* Airline
	* Origin airport
	* Destination airport
	* Day of Week
	* Cancellation field

	This expanded the dataset into thousands of columns but preserved categorical meaning.

	#### 2. Scaling important numerical fields

	* Distance
	* CRSDepTime
	* CRSArrTime
	* AirTime

	Scaling prevents models like Logistic Regression and Gradient Boosting from being biased by large numeric ranges.

	#### 3. PCA (optional)

	Used only for visualization; helped validate that the classes are somewhat separable.

	#### 4. K-Means clustering (optional exploratory step)

	Cluster labels added as an experimental feature to see if they help models (they had mild impact).

	⬇️

	### Place FE graphs here

	![צילום מסך 2025-11-29 ב-9.45.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/T7pjOhFJL1Zn54OroFK9T.png)


	![צילום מסך 2025-11-29 ב-9.45.26](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/Tq6yVLGH-w8tLty1rbNQG.png)


	# 🤖 5. Models Trained

	We compared three supervised classification models:

	### ✔ Logistic Regression

	* Simple baseline
	* Fast, linear, interpretable
	* Surprisingly produced perfect predictions (overfitting to clean, thresholded labels)

	### ✔ Random Forest Classifier

	* Non-linear
	* Handles high-dimensional data
	* Good but struggled with high-delay recall

	### ✔ Gradient Boosting Classifier

	* Ensemble of weak learners
	* Best real-world performance
	* Most balanced precision–recall
	* Strong against noise
	* Best generalization to unseen data

	⬇️

	### Insert models summary image

	![צילום מסך 2025-11-29 ב-9.45.46](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/LWGxTGHU-gYW2QRFOjhm_.png)

	![צילום מסך 2025-11-29 ב-9.46.01](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/WelQ-fbqravyTW1nYv4pW.png)

	![צילום מסך 2025-11-29 ב-9.46.11](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/kA863njS1KJj4ZvvIFvAq.png)


	# 🏆 6. Winning Model

	The selected model is:

	# 🌟 Gradient Boosting Classifier

	### Why this one?

	* Best tradeoff between false positives and false negatives
	* Highest real F1-score
	* Handles imbalanced patterns better
	* Robust to feature noise and outliers
	* Most realistic generalization

	## 7. Regression-to-Classification

	### 7.1 Creating Classes from the Numeric Target (Median Split)

	In this part we reframed the original regression target ArrDelay into a
	binary classification target.

	We computed the median arrival delay on the training set (≈ −5 minutes) and
	used it as a threshold:

	- Class 0 – Low delay: `ArrDelay < median`
	(flight is on time or earlier than a typical flight in the dataset).
	- Class 1 – High delay: `ArrDelay ≥ median`
	(flight is more delayed than a typical flight).

	The same rule was applied to both train and test targets, using the **same
	engineered features** as in the regression part.
	This keeps the classification task aligned with the original question:
	> “How large will the arrival delay be?”
	now phrased as
	> “Will this flight have a higher-than-typical delay or not?”


	### 7.2 Checking Class Balance

	After creating the classes, we examined their distribution:

	- Training set:
	about 50.6% High delay (Class 1) and 49.4% Low delay (Class 0).
	- Test set:
	about 51.3% Low delay (Class 0) and 48.7% High delay (Class 1).

	The classes are therefore well balanced, and no class is clearly
	under-represented.

	Because of this balance, accuracy is already informative, but to avoid
	being misled in edge cases and to keep the focus on the “High delay” class,
	we mainly compared models using the F1-score (which combines precision and
	recall for the positive class).

	👉 *Here I will insert a bar plot (or table screenshot) of the class
	distribution in train/test.*

	![צילום מסך 2025-11-29 ב-9.55.24](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/fSo9lYbeNOK6_8qrBtFRc.png)

	![צילום מסך 2025-11-29 ב-9.55.39](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/-kh6L76mQaE9tJv4nymxA.png)

	## 8. Train & Evaluate Classification Models

	### 8.1 Precision vs. Recall — What Matters More?

	In the context of predicting high-delay flights, recall for the positive class is more important than precision.

	The reason:
	Missing a truly delayed flight (false negative) is operationally worse than mistakenly flagging
	an on-time flight as delayed (false positive).
	A missed severe delay can lead to missed connections, poor customer experience, and scheduling disruptions,
	while a false alarm only causes minor adjustments like extra buffer time.

	---

	### 8.1 False Positives vs. False Negatives — Which Is Worse?

	- A false positive means predicting “high delay” when the flight is actually low-delay.
	- A false negative means predicting “low delay” when the flight is actually highly delayed.

	In our task, false negatives are more critical, because they leave planners unprepared for major delays.
	False positives are less harmful — they may cause unnecessary caution, but do not create operational failures.

	---

	### 8.2 Training Three Classification Models

	We trained and evaluated three different models from scikit-learn, using the same engineered features
	and the binary target created in Part 7:

	1. Logistic Regression
	2. Random Forest Classifier
	3. Gradient Boosting Classifier

	👉 Insert model training diagram or screenshots of code here (optional).

	![צילום מסך 2025-11-29 ב-9.57.44](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/vHzlqE8vnf7tRBxACgY-V.png)

	![צילום מסך 2025-11-29 ב-9.57.59](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o4D5WklP1INIFBvvubdf3.png)

	![צילום מסך 2025-11-29 ב-9.58.14](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/o9L76PgbO7hWmyEZIfQHL.png)

	![צילום מסך 2025-11-29 ב-9.58.25](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/5BaZaCtq0RDU4Sg_kAneC.png)

	![צילום מסך 2025-11-29 ב-9.58.36](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/0YggL_58zalfn50WokKf0.png)

	![צילום מסך 2025-11-29 ב-9.58.48](https://cdn-uploads.huggingface.co/production/uploads/69183cd79510e3441ef86afc/S896FbQSOUX4pQTl5ym4d.png)


	### 8.3 Model Evaluation

	For each model we generated:

	- `classification_report` (precision, recall, F1-score, support)
	- Confusion matrix
	- Interpretation of the types of errors the model makes

	Below is a summary of the results:

	#### Logistic Regression
	- Achieved perfect classification on the test set (F1 = 1.00).
	- The confusion matrix shows 0 errors.
	- This suggests the engineered features were highly separable.


	#### Random Forest Classifier
	- F1-score ≈ 0.79
	- Stronger recall for Class 0 (low delay), weaker for Class 1 (high delay).
	- Confusion matrix shows the model tends to miss high-delay flights (false negatives).


	#### Gradient Boosting Classifier
	- F1-score ≈ 0.85
	- Better balance between precision and recall compared to Random Forest.
	- Fewer false negatives than Random Forest and more consistent performance overall.


	### 8.3 Which Model Performs Best — and Why?

	The best model is the Logistic Regression, because:

	- It achieves perfect predictive performance on this dataset.
	- It cleanly separates the engineered feature space into the two classes.
	- It avoids the false negatives that are most critical in this task.
	- Its confusion matrix shows zero misclassifications.

	While this may indicate a highly separable dataset rather than model superiority alone,
	within the scope of this assignment it is the clear winner.

	---

	### 8.4 Winner: Exporting and Uploading the Model

	We exported the winning model (Logistic Regression) to a pickle file and uploaded it to the HuggingFace repository:

	- File: `winning_classifier_model.pkl`
	- Stored alongside the earlier regression winning model file:
	- `winning_model.pkl`

	Both files live in the same HuggingFace model repository as required.



	# 🎥 9. Video Presentation

	Your recording should include:

	* Quick dataset overview
	* Key EDA takeaways
	* How you encoded and engineered features
	* Explanation of each model
	* Confusion matrices
	* Why Gradient Boosting won
	* Summary of lessons learned