Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Detailed Project Workflow (Parts 1-6)
|
| 2 |
+
Part 1: Data Loading & Initial Inspection
|
| 3 |
+
Action: Loading the raw flight dataset and performing an initial check using .info() and .describe().
|
| 4 |
+
|
| 5 |
+
Goal: Understanding the data types, identifying the number of features, and getting a sense of the scale of the flight delays.
|
| 6 |
+
|
| 7 |
+
Part 2: Exploratory Data Analysis (EDA)
|
| 8 |
+
Action: Visualizing relationships between features like Departure Delay, Distance, and our target Arrival Delay.
|
| 9 |
+
|
| 10 |
+
Goal: Identifying which factors have the strongest influence on flight delays.
|
| 11 |
+
|
| 12 |
+
📸 [PLOT REQUIRED]: Correlation Heatmap. Insert the heatmap here to show how Departure Delay is the strongest predictor of Arrival Delay.
|
| 13 |
+
|
| 14 |
+
📸 [PLOT REQUIRED]: Histograms. Insert distributions of delay minutes to show the "Long Tail" of the data.
|
| 15 |
+
|
| 16 |
+
Part 3: Data Cleaning
|
| 17 |
+
Action: Handling missing values (NaNs) and removing redundant columns that don't contribute to prediction.
|
| 18 |
+
|
| 19 |
+
Goal: Ensuring the dataset is "clean" so the models don't learn from noise or missing information.
|
| 20 |
+
|
| 21 |
+
Part 4: Handling Outliers
|
| 22 |
+
Action: Identifying and treating extreme values in delays and distances.
|
| 23 |
+
|
| 24 |
+
Goal: Preventing "extreme" flights (e.g., 10-hour delays) from biasing the model and ruining the predictions for the majority of standard flights.
|
| 25 |
+
|
| 26 |
+
📸 [PLOT REQUIRED]: Boxplots. Insert "Before & After" boxplots here to show how we handled the outliers.
|
| 27 |
+
|
| 28 |
+
Part 5: Feature Engineering - The "Travel Profile"
|
| 29 |
+
Action: This is a core part of our research. We combined several categorical variables to create a unique Travel Profile for each passenger (e.g., combining 'Purpose of Travel' with 'Flight Distance').
|
| 30 |
+
|
| 31 |
+
Goal: To capture the specific behavior of different types of travelers, which adds a layer of "human behavior" to the technical flight data.
|
| 32 |
+
|
| 33 |
+
Part 6: Final Preprocessing (Encoding & Scaling)
|
| 34 |
+
Action: 1. One-Hot Encoding: Converting the Travel Profile and other categories into numerical format.
|
| 35 |
+
2. Standard Scaling: Normalizing numerical features (like Distance) so they are on the same scale.
|
| 36 |
+
|
| 37 |
+
Goal: Preparing the data for the algorithms. Models like Logistic Regression work much better when all numbers are in a similar range.
|
| 38 |
+
|
| 39 |
+
Part 7.1: Defining the Classification Task
|
| 40 |
+
Strategy: Business Rule Threshold.
|
| 41 |
+
Logic: We set a threshold of 0 minutes. Any flight with an arrival delay > 0 is labeled as 1 (Delayed), and others as 0 (On Time). This aligns with passenger expectations for punctuality.
|
| 42 |
+
|
| 43 |
+
Part 7.2: Class Balance Check
|
| 44 |
+
We analyzed the distribution of our target classes:
|
| 45 |
+
|
| 46 |
+
Delayed (1): 55%
|
| 47 |
+
|
| 48 |
+
On Time (0): 45%
|
| 49 |
+
Conclusion: The dataset is well-balanced, allowing the models to learn effectively from both categories.
|
| 50 |
+
|
| 51 |
+
[Insert Plot here: Bar chart of class distribution]
|
| 52 |
+
|
| 53 |
+
Part 8.1: Business Logic & Metrics
|
| 54 |
+
Critical Error: False Positive (Predicting a delay when the flight is on time). This is critical because it could cause a passenger to arrive late and miss their flight.
|
| 55 |
+
Primary Metric: Precision. We aim to be highly accurate when predicting a delay.
|
| 56 |
+
|
| 57 |
+
Part 8.2: Training Classification Models
|
| 58 |
+
We trained three different algorithms:
|
| 59 |
+
|
| 60 |
+
Logistic Regression: A fast, linear baseline model.
|
| 61 |
+
|
| 62 |
+
Decision Tree: An interpretable model based on logical splits.
|
| 63 |
+
|
| 64 |
+
Random Forest: An ensemble model for high accuracy and complex patterns.
|
| 65 |
+
|
| 66 |
+
Part 8.3: Evaluation & Error Analysis
|
| 67 |
+
Each model was evaluated using a Classification Report and a Confusion Matrix.
|
| 68 |
+
|
| 69 |
+
Logistic Regression: Showed the lowest number of False Positives (686).
|
| 70 |
+
|
| 71 |
+
Random Forest: High overall performance but more critical errors (1470).
|
| 72 |
+
|
| 73 |
+
Decision Tree: Highest error rate (2974 False Positives).
|
| 74 |
+
|
| 75 |
+
[Insert Plots here: The Purple Heatmaps for each model]
|
| 76 |
+
|
| 77 |
+
Part 8.4: Winner Selection & Export
|
| 78 |
+
Winner: Logistic Regression.
|
| 79 |
+
Reason: It provided the highest Precision, making it the safest model for passenger-facing applications by minimizing the risk of missed flights.
|
| 80 |
+
Export: The model was saved as best_flight_model.pkl for deployment.
|