Yoel125
/

Assignment_2_data_science

Model card Files Files and versions

xet

Community

Yoel125 commited on Apr 29

Commit

cdbbd6c

verified ·

1 Parent(s): 1f84e2c

Create README.md

Browse files

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+Detailed Project Workflow (Parts 1-6)
+Part 1: Data Loading & Initial Inspection
+Action: Loading the raw flight dataset and performing an initial check using .info() and .describe().
+Goal: Understanding the data types, identifying the number of features, and getting a sense of the scale of the flight delays.
+Part 2: Exploratory Data Analysis (EDA)
+Action: Visualizing relationships between features like Departure Delay, Distance, and our target Arrival Delay.
+Goal: Identifying which factors have the strongest influence on flight delays.
+📸 [PLOT REQUIRED]: Correlation Heatmap. Insert the heatmap here to show how Departure Delay is the strongest predictor of Arrival Delay.
+📸 [PLOT REQUIRED]: Histograms. Insert distributions of delay minutes to show the "Long Tail" of the data.
+Part 3: Data Cleaning
+Action: Handling missing values (NaNs) and removing redundant columns that don't contribute to prediction.
+Goal: Ensuring the dataset is "clean" so the models don't learn from noise or missing information.
+Part 4: Handling Outliers
+Action: Identifying and treating extreme values in delays and distances.
+Goal: Preventing "extreme" flights (e.g., 10-hour delays) from biasing the model and ruining the predictions for the majority of standard flights.
+📸 [PLOT REQUIRED]: Boxplots. Insert "Before & After" boxplots here to show how we handled the outliers.
+Part 5: Feature Engineering - The "Travel Profile"
+Action: This is a core part of our research. We combined several categorical variables to create a unique Travel Profile for each passenger (e.g., combining 'Purpose of Travel' with 'Flight Distance').
+Goal: To capture the specific behavior of different types of travelers, which adds a layer of "human behavior" to the technical flight data.
+Part 6: Final Preprocessing (Encoding & Scaling)
+Action: 1.  One-Hot Encoding: Converting the Travel Profile and other categories into numerical format.
+2.  Standard Scaling: Normalizing numerical features (like Distance) so they are on the same scale.
+Goal: Preparing the data for the algorithms. Models like Logistic Regression work much better when all numbers are in a similar range.
+Part 7.1: Defining the Classification Task
+Strategy: Business Rule Threshold.
+Logic: We set a threshold of 0 minutes. Any flight with an arrival delay > 0 is labeled as 1 (Delayed), and others as 0 (On Time). This aligns with passenger expectations for punctuality.
+Part 7.2: Class Balance Check
+We analyzed the distribution of our target classes:
+Delayed (1): 55%
+On Time (0): 45%
+Conclusion: The dataset is well-balanced, allowing the models to learn effectively from both categories.
+[Insert Plot here: Bar chart of class distribution]
+Part 8.1: Business Logic & Metrics
+Critical Error: False Positive (Predicting a delay when the flight is on time). This is critical because it could cause a passenger to arrive late and miss their flight.
+Primary Metric: Precision. We aim to be highly accurate when predicting a delay.
+Part 8.2: Training Classification Models
+We trained three different algorithms:
+Logistic Regression: A fast, linear baseline model.
+Decision Tree: An interpretable model based on logical splits.
+Random Forest: An ensemble model for high accuracy and complex patterns.
+Part 8.3: Evaluation & Error Analysis
+Each model was evaluated using a Classification Report and a Confusion Matrix.
+Logistic Regression: Showed the lowest number of False Positives (686).
+Random Forest: High overall performance but more critical errors (1470).
+Decision Tree: Highest error rate (2974 False Positives).
+[Insert Plots here: The Purple Heatmaps for each model]
+Part 8.4: Winner Selection & Export
+Winner: Logistic Regression.
+Reason: It provided the highest Precision, making it the safest model for passenger-facing applications by minimizing the risk of missed flights.
+Export: The model was saved as best_flight_model.pkl for deployment.