Yoel125
/

Assignment_2_data_science

Model card Files Files and versions

xet

Community

Yoel125 commited on Apr 30

Commit

1ccd450

verified ·

1 Parent(s): 529205a

Update README.md

Browse files

Files changed (1) hide show

README.md +44 -0

README.md CHANGED Viewed

@@ -112,3 +112,47 @@ This visual confirmation ensures that the groups created by the K-Means algorith
 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oVyGiFtEU_f4miqJwzIpN.png)
 This shows that passengers are grouped into three distinct "service profiles" based on their ratings. This creates a meaningful "travel profile" feature, allowing us to test how different passenger experiences impact flight delays in our main research question.

 ![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oVyGiFtEU_f4miqJwzIpN.png)
 This shows that passengers are grouped into three distinct "service profiles" based on their ratings. This creates a meaningful "travel profile" feature, allowing us to test how different passenger experiences impact flight delays in our main research question.
+# Part 5: Model Training & Evaluation
+In this section, I compared three different machine learning algorithms to determine the most accurate model for predicting flight arrival delays using the engineered features.
+#### 1. Updated Linear Regression (Refined Baseline)
+re-trained the Linear Regression model using the full set of features, including the new K-Means passenger profiles and encoded categorical variables.
+This allowed me to see how much the additional feature engineering improved the initial baseline performance.
+###  2. Decision Tree Regressor
+implemented a Decision Tree model to capture non-linear relationships between the features.
+To prevent overfitting, I set a max_depth=5, ensuring the model remains generalized and performs well on unseen data.
+### 3. Random Forest Regressor (Ensemble Method)
+trained a Random Forest model consisting of 100 individual trees.
+By averaging the predictions of multiple trees, this ensemble approach typically reduces error and provides a more stable R^2
+score compared to a single decision tree.
+### 4. Performance Comparison & Visualization
+created a Comparison Bar Chart to visualize which model explains the highest percentage of variance in arrival delays, making it easy to identify the top-performing algorithm.
+![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/rPEwtAabowCwz5O5_FH7O.png)
+we cam see that the Random Forest model is the winner because it aggregates multiple decision trees, which reduces variance and prevents overfitting, leading to more stable and accurate predictions for arrival delays.
+Part 7: Regression-to-Classification
+In this section, I reframed the original regression problem (predicting the exact number of delay minutes) into a classification problem. This allows for a different strategic approach to understanding flight punctuality.
+# Part 7
+### 7.1 Creating Classes from Numeric Target
+Conversion Strategy:applied a Business Rule Threshold to convert the continuous target into discrete categories.
+Threshold Selection:defined the cutoff point at 0 minutes.
+From an operational perspective, any flight arriving even one minute after its scheduled time is considered delayed.
+Therefore:
+Class 0 (On-Time/Early): Arrival delay ≤ 0 minutes.
+Class 1 (Delayed): Arrival delay > 0 minutes.
+Implementation: This transformation was applied consistently to both the training and testing sets to ensure the validity of the classification models.
+### 7.2 Class Balance Analysis
+The results show the classes are well-balanced with 55% delayed and 45% not delayed flights.
+Since the groups are almost equal, our model can learn from both types of data effectively.
+![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/FOYjMrNglJ4SnKVfwDQ1X.png)
+# Part 8: Classification Model Evaluation & Results
+In this final analytical stage, I evaluated the three trained classifiers using Confusion Matrices to understand their prediction patterns and error types.
+### Model Performance Analysis:
+Logistic Regression: This model served as a strong baseline, achieving the highest number of True Negatives (10,380). It is highly reliable at identifying on-time flights but struggle with a significant number of False Negatives (3,079), meaning it often misses actual delays.
+Decision Tree: While it captured the highest number of True Positives (5,989), it suffered from the highest rate of False Positives (2,974). This indicates that the single tree is prone to "over-detecting" delays, leading to many false alarms.
+Random Forest: This model provided the most balanced performance. It maintained a high count of True Negatives (9,596) while successfully identifying 5,902 delayed flights with significantly fewer false alarms than the Decision Tree.
+![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/TZzzpOliTb8WGjzw8W9PM.png)
+![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/X2R8v3yyBl3Op-6KrC9Ny.png)
+![image](https://cdn-uploads.huggingface.co/production/uploads/69c79aa8f856b118f80df631/oYWz2NUXdC0qmirSp50tU.png)
+The Random Forest model is the overall winner for this task. It offers a superior trade-off between precision and recall, making it the most robust tool for predicting