Yoel125 commited on
Commit
cdbbd6c
·
verified ·
1 Parent(s): 1f84e2c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Detailed Project Workflow (Parts 1-6)
2
+ Part 1: Data Loading & Initial Inspection
3
+ Action: Loading the raw flight dataset and performing an initial check using .info() and .describe().
4
+
5
+ Goal: Understanding the data types, identifying the number of features, and getting a sense of the scale of the flight delays.
6
+
7
+ Part 2: Exploratory Data Analysis (EDA)
8
+ Action: Visualizing relationships between features like Departure Delay, Distance, and our target Arrival Delay.
9
+
10
+ Goal: Identifying which factors have the strongest influence on flight delays.
11
+
12
+ 📸 [PLOT REQUIRED]: Correlation Heatmap. Insert the heatmap here to show how Departure Delay is the strongest predictor of Arrival Delay.
13
+
14
+ 📸 [PLOT REQUIRED]: Histograms. Insert distributions of delay minutes to show the "Long Tail" of the data.
15
+
16
+ Part 3: Data Cleaning
17
+ Action: Handling missing values (NaNs) and removing redundant columns that don't contribute to prediction.
18
+
19
+ Goal: Ensuring the dataset is "clean" so the models don't learn from noise or missing information.
20
+
21
+ Part 4: Handling Outliers
22
+ Action: Identifying and treating extreme values in delays and distances.
23
+
24
+ Goal: Preventing "extreme" flights (e.g., 10-hour delays) from biasing the model and ruining the predictions for the majority of standard flights.
25
+
26
+ 📸 [PLOT REQUIRED]: Boxplots. Insert "Before & After" boxplots here to show how we handled the outliers.
27
+
28
+ Part 5: Feature Engineering - The "Travel Profile"
29
+ Action: This is a core part of our research. We combined several categorical variables to create a unique Travel Profile for each passenger (e.g., combining 'Purpose of Travel' with 'Flight Distance').
30
+
31
+ Goal: To capture the specific behavior of different types of travelers, which adds a layer of "human behavior" to the technical flight data.
32
+
33
+ Part 6: Final Preprocessing (Encoding & Scaling)
34
+ Action: 1. One-Hot Encoding: Converting the Travel Profile and other categories into numerical format.
35
+ 2. Standard Scaling: Normalizing numerical features (like Distance) so they are on the same scale.
36
+
37
+ Goal: Preparing the data for the algorithms. Models like Logistic Regression work much better when all numbers are in a similar range.
38
+
39
+ Part 7.1: Defining the Classification Task
40
+ Strategy: Business Rule Threshold.
41
+ Logic: We set a threshold of 0 minutes. Any flight with an arrival delay > 0 is labeled as 1 (Delayed), and others as 0 (On Time). This aligns with passenger expectations for punctuality.
42
+
43
+ Part 7.2: Class Balance Check
44
+ We analyzed the distribution of our target classes:
45
+
46
+ Delayed (1): 55%
47
+
48
+ On Time (0): 45%
49
+ Conclusion: The dataset is well-balanced, allowing the models to learn effectively from both categories.
50
+
51
+ [Insert Plot here: Bar chart of class distribution]
52
+
53
+ Part 8.1: Business Logic & Metrics
54
+ Critical Error: False Positive (Predicting a delay when the flight is on time). This is critical because it could cause a passenger to arrive late and miss their flight.
55
+ Primary Metric: Precision. We aim to be highly accurate when predicting a delay.
56
+
57
+ Part 8.2: Training Classification Models
58
+ We trained three different algorithms:
59
+
60
+ Logistic Regression: A fast, linear baseline model.
61
+
62
+ Decision Tree: An interpretable model based on logical splits.
63
+
64
+ Random Forest: An ensemble model for high accuracy and complex patterns.
65
+
66
+ Part 8.3: Evaluation & Error Analysis
67
+ Each model was evaluated using a Classification Report and a Confusion Matrix.
68
+
69
+ Logistic Regression: Showed the lowest number of False Positives (686).
70
+
71
+ Random Forest: High overall performance but more critical errors (1470).
72
+
73
+ Decision Tree: Highest error rate (2974 False Positives).
74
+
75
+ [Insert Plots here: The Purple Heatmaps for each model]
76
+
77
+ Part 8.4: Winner Selection & Export
78
+ Winner: Logistic Regression.
79
+ Reason: It provided the highest Precision, making it the safest model for passenger-facing applications by minimizing the risk of missed flights.
80
+ Export: The model was saved as best_flight_model.pkl for deployment.