Barvero commited on
Commit
bfecc75
·
verified ·
1 Parent(s): 9f85bad

Update README file.

Browse files
Files changed (1) hide show
  1. README.md +53 -52
README.md CHANGED
@@ -1,26 +1,26 @@
1
  Flight Delay Prediction — Full Project (Parts 1-8) -
 
2
  Overview -
3
- This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
4
- The work is structured into eight clear stages, following the assignment’s required workflow.
5
 
6
  1. Dataset Overview -
7
- The dataset contains ~96K domestic U.S. flights.
8
  It includes:
9
- Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
10
- Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
11
- Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
12
- Target variable: ARRIVAL_DELAY (minutes)
13
- Goal:
14
- Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
15
 
16
  2. Exploratory Data Analysis (EDA) -
17
- Main steps performed:
18
  - Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
19
  - Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
20
  - Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
21
  - Compared delays between airlines.
22
  - Visualized relationship between distance and delay.
23
- Key Findings:
24
  - Certain months show heavier congestion.
25
  - Evening flights have systematically higher delays (“snowball effect”).
26
  - Airlines differ strongly in punctuality.
@@ -29,85 +29,86 @@ Key Findings:
29
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)
30
 
31
  3. Baseline Regression Model -
32
- Steps completed:
33
  - Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
34
  - Used only information available before takeoff.
35
  - Trained a simple Linear Regression model.
36
  - Evaluated the model using MAE, MSE, RMSE, and R².
37
- Results (Baseline):
38
  - RMSE ≈ 9.23 minutes
39
  - R² ≈ 0.88
40
  - Train/test scores were close - no overfitting.
41
 
42
  4. Feature Engineering -
43
- Performed multiple transformations to enhance model performance:
44
- 4.1 Encoding -
45
- One-Hot Encoding - AIRLINE
46
- Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
47
- 4.2 New Features -
48
- IS_WEEKEND — captures weekend travel differences
49
- 4.3 Clustering -
50
- Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
51
- Added new feature: FLIGHT_CLUSTER
52
- 4.4 Dimensionality Reduction -
53
- PCA visualization to validate that clusters form meaningful groups
54
- 4.5 Scaling -
55
- Removed leakage / irrelevant fields
56
- Scaled the final 33 features using StandardScaler
57
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)
58
 
59
  5. Improved Regression Models -
60
- Trained three models on the engineered dataset:
61
  - Linear Regression (Improved)
62
  - Random Forest Regressor
63
  - Gradient Boosting Regressor
64
- Results:
65
- Gradient Boosting achieved best performance
66
  - RMSE ≈ 9.04
67
  - R² ≈ 0.89
68
- This improves over the baseline because tree-based models capture non-linear relationships.
69
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
70
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)
71
 
72
  6. Winning Regression Model + Deployment -
73
  - Selected Gradient Boosting Regressor as winner.
74
  - Exported it using pickle:
75
- with open("winning_model.pkl", "wb") as f:
76
- pickle.dump(best_model, f)
 
77
  - Uploaded to a dedicated HuggingFace model repository.
78
 
79
  7. Regression - Classification -
80
- 7.1 Creating Classes -
81
- Converted arrival delay into 3 classes using quantile binning:
82
  - Class 0: lowest 33% delays
83
  - Class 1: middle 33%
84
  - Class 2: highest 33%
85
- Why?
86
- This ensures balanced classes and avoids distortions caused by skewed delay distributions.
87
- 7.2 Class Balance Check -
88
- Class distribution remained balanced (≈ 32–36% per class).
89
- Therefore:
90
  - Accuracy is meaningful
91
  - Also tracked macro-F1 to ensure fair performance across classes
92
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)
93
 
94
  8. Classification Models -
95
- 8.1 Precision vs Recall -
96
- We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
97
- 8.2 Models Trained
98
  - Logistic Regression
99
  - Random Forest Classifier
100
  - Gradient Boosting Classifier
101
- All trained on the engineered features.
102
- 8.3 Evaluation
103
- For each model:
104
  - Classification report (precision, recall, F1-score)
105
  - Confusion matrix
106
  - Analysis of error patterns
107
- Best model (macro F1): Logistic Regression
108
- Even though simple, it produced the most balanced performance across all classes.
109
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
110
- 8.4 Exporting the Winning Classifier
111
- with open("winning_classifier.pkl", "wb") as f:
112
- pickle.dump(best_cls_model, f)
113
- Uploaded the classifier to the same HF repository as required.
 
1
  Flight Delay Prediction — Full Project (Parts 1-8) -
2
+
3
  Overview -
4
+ - This project analyzes U.S. domestic flight data and builds both regression and classification models to predict flight arrival delays.
5
+ - The work is structured into eight clear stages, following the assignment’s required workflow.
6
 
7
  1. Dataset Overview -
8
+ - The dataset contains ~96K domestic U.S. flights.
9
  It includes:
10
+ - Scheduling information (YEAR, MONTH, DAY, SCHEDULED_DEPARTURE, etc.)
11
+ - Operational details (DISTANCE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT)
12
+ - Delay-related causes (WEATHER_DELAY, NAS_DELAY, LATE_AIRCRAFT_DELAY, etc.)
13
+ - Target variable: ARRIVAL_DELAY (minutes)
14
+ - Goal: Build predictive models to estimate arrival delay and explore the key operational factors affecting punctuality.
 
15
 
16
  2. Exploratory Data Analysis (EDA) -
17
+ - Main steps performed:
18
  - Checked missing values → Only a few columns contained missing values, and all were handled explicitly.
19
  - Identified relevant delay columns and analyzed their contribution (weather, NAS, late aircraft).
20
  - Examined seasonality and time-of-day patterns (monthly delays, hourly delays).
21
  - Compared delays between airlines.
22
  - Visualized relationship between distance and delay.
23
+ - Key Findings:
24
  - Certain months show heavier congestion.
25
  - Evening flights have systematically higher delays (“snowball effect”).
26
  - Airlines differ strongly in punctuality.
 
29
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/CFyFBQyYKfMihX7Qyfq6Q.png)
30
 
31
  3. Baseline Regression Model -
32
+ - Steps completed:
33
  - Removed leakage features (e.g., DEPARTURE_DELAY, WHEELS_ON).
34
  - Used only information available before takeoff.
35
  - Trained a simple Linear Regression model.
36
  - Evaluated the model using MAE, MSE, RMSE, and R².
37
+ - Results (Baseline):
38
  - RMSE ≈ 9.23 minutes
39
  - R² ≈ 0.88
40
  - Train/test scores were close - no overfitting.
41
 
42
  4. Feature Engineering -
43
+ - Performed multiple transformations to enhance model performance:
44
+ - 4.1 Encoding -
45
+ - One-Hot Encoding - AIRLINE
46
+ - Frequency Encoding - ORIGIN_AIRPORT, DESTINATION_AIRPORT
47
+ - 4.2 New Features -
48
+ - IS_WEEKEND — captures weekend travel differences
49
+ - 4.3 Clustering -
50
+ - Applied K-Means (k=4) on DISTANCE and SCHEDULED_TIME
51
+ - Added new feature: FLIGHT_CLUSTER
52
+ - 4.4 Dimensionality Reduction -
53
+ - PCA visualization to validate that clusters form meaningful groups
54
+ - 4.5 Scaling -
55
+ - Removed leakage / irrelevant fields
56
+ - Scaled the final 33 features using StandardScaler
57
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/Vv6ExAplhy8ogI_wummhI.png)
58
 
59
  5. Improved Regression Models -
60
+ - Trained three models on the engineered dataset:
61
  - Linear Regression (Improved)
62
  - Random Forest Regressor
63
  - Gradient Boosting Regressor
64
+ - Results:
65
+ - Gradient Boosting achieved best performance
66
  - RMSE ≈ 9.04
67
  - R² ≈ 0.89
68
+ - This improves over the baseline because tree-based models capture non-linear relationships.
69
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/caa8MPXWSZjFLKELJVjtR.png)
70
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/5lxtX4NaMmHEPOhj-WF4X.png)
71
 
72
  6. Winning Regression Model + Deployment -
73
  - Selected Gradient Boosting Regressor as winner.
74
  - Exported it using pickle:
75
+ - with open("winning_model.pkl", "wb") as f:
76
+ - pickle.dump(best_model, f)
77
+
78
  - Uploaded to a dedicated HuggingFace model repository.
79
 
80
  7. Regression - Classification -
81
+ - 7.1 Creating Classes -
82
+ - Converted arrival delay into 3 classes using quantile binning:
83
  - Class 0: lowest 33% delays
84
  - Class 1: middle 33%
85
  - Class 2: highest 33%
86
+ - Why?
87
+ - This ensures balanced classes and avoids distortions caused by skewed delay distributions.
88
+ - 7.2 Class Balance Check -
89
+ - Class distribution remained balanced (≈ 32–36% per class).
90
+ - Therefore:
91
  - Accuracy is meaningful
92
  - Also tracked macro-F1 to ensure fair performance across classes
93
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/fgqVECFXrTjsvJpxoWFgd.png)
94
 
95
  8. Classification Models -
96
+ - 8.1 Precision vs Recall -
97
+ - We evaluated both but emphasized recall because misclassifying high-delay flights as low-delay is more costly than the opposite.
98
+ - 8.2 Models Trained
99
  - Logistic Regression
100
  - Random Forest Classifier
101
  - Gradient Boosting Classifier
102
+ - All trained on the engineered features.
103
+ - 8.3 Evaluation
104
+ - For each model:
105
  - Classification report (precision, recall, F1-score)
106
  - Confusion matrix
107
  - Analysis of error patterns
108
+ - Best model (macro F1): Logistic Regression
109
+ - Even though simple, it produced the most balanced performance across all classes.
110
  ![image](https://cdn-uploads.huggingface.co/production/uploads/690cf480f5e17706452c5d7c/l8BoeRnFkX28fDwZrArNx.png)
111
+ - 8.4 Exporting the Winning Classifier
112
+ - with open("winning_classifier.pkl", "wb") as f:
113
+ - pickle.dump(best_cls_model, f)
114
+ - Uploaded the classifier to the same HF repository as required.