sks01dev commited on
Commit
14b2c6d
Β·
verified Β·
1 Parent(s): 925b9ce

Delete Week 2

Browse files
Files changed (2) hide show
  1. Week 2/readme.md +0 -99
  2. Week 2/week_2.ipynb +0 -0
Week 2/readme.md DELETED
@@ -1,99 +0,0 @@
1
- # πŸš— Car Fuel Efficiency Predictor (ML Zoomcamp W2)
2
-
3
- ## ✍️ Description
4
-
5
- This project develops a **Linear Regression model** to predict car fuel efficiency in Miles Per Gallon (`fuel_efficiency_mpg`) using a subset of car features. The primary focus is on mastering the fundamental machine learning workflow: handling missing data, proper train/validation/test splitting, and understanding model stability and regularization.
6
-
7
- ### Key Objectives:
8
-
9
- * **Data Preprocessing:** Analyze missing values and compare imputation strategies (mean vs. zero).
10
- * **Model Training:** Implement a Linear Regression model using `scikit-learn`.
11
- * **Model Evaluation & Stability:** Use RMSE to evaluate model performance and assess the impact of different random seeds on model stability.
12
- * **Regularization:** Test the effect of Ridge regularization (L2) to prevent overfitting.
13
-
14
- ---
15
-
16
- ## βš™οΈ Installation
17
-
18
- To run this notebook, you need a standard Python environment with the following dependencies.
19
-
20
- 1. **Clone the repository:**
21
- ```bash
22
- git clone [https://github.com/sks01dev/data-science-lab/](https://github.com/sks01dev/data-science-lab/)
23
- cd data-science-lab
24
- ```
25
-
26
- 2. **Install dependencies:**
27
- ```bash
28
- pip install numpy pandas scikit-learn matplotlib seaborn
29
- ```
30
-
31
- 3. **Download Data:** The dataset is fetched directly within the notebook from the official source:
32
- ```bash
33
- wget [https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)
34
- ```
35
-
36
- ---
37
-
38
- ## πŸƒ Usage
39
-
40
- Execute the `week_2.ipynb` notebook cell-by-cell to follow the full machine learning workflow.
41
-
42
- 1. **Data Selection:** Only the following columns are used: `'engine_displacement'`, `'horsepower'`, `'vehicle_weight'`, `'model_year'`, and `'fuel_efficiency_mpg'`.
43
- 2. **Splitting:** The data is shuffled (seed 42) and split into Train (60%), Validation (20%), and Test (20%) sets.
44
- 3. **Model Comparison:** Linear Regression models are trained to compare performance after filling missing `'horsepower'` values with 0 vs. the training set mean.
45
- 4. **Final Evaluation:** The best configuration is tested on the final, unseen test set.
46
-
47
- ---
48
-
49
- ## πŸ› οΈ Technologies Used
50
-
51
- | Technology | Purpose | Badge/Icon |
52
- | :--- | :--- | :--- |
53
- | **Python** | Core programming and analysis language | [![Python](https://img.shields.io/badge/Python-3.x-blue?style=flat-square&logo=python&logoColor=white)](https://www.python.org/doc/) |
54
- | **NumPy** | Numerical operations and RMSE calculation | [![NumPy](https://img.shields.io/badge/NumPy-1.x-blue?style=flat-square&logo=numpy&logoColor=white)](https://numpy.org/doc/) |
55
- | **Pandas** | Data loading, manipulation, and missing value handling | [![Pandas](https://img.shields.io/badge/Pandas-2.x-150458?style=flat-square&logo=pandas&logoColor=white)](https://pandas.pydata.org/docs/) |
56
- | **scikit-learn** | Linear Regression, Ridge, and `train_test_split` | [![Scikit-learn](https://img.shields.io/badge/Scikit--learn-1.x-orange?style=flat-square&logo=scikit-learn&logoColor=white)](https://scikit-learn.org/stable/documentation.html) |
57
- | **Seaborn/Matplotlib** | Data visualization and distribution checks | [![Seaborn](https://img.shields.io/badge/Seaborn-0.12-darkgreen?style=flat-square&logo=seaborn&logoColor=white)](https://seaborn.pydata.org/) |
58
-
59
- ---
60
-
61
- ## πŸ“Š Dataset Used
62
-
63
- * **Car Fuel Efficiency Dataset:** A collection of vehicle attributes (displacement, horsepower, weight, model year) used to predict fuel consumption (`fuel_efficiency_mpg`). [Image of Fuel Efficiency Distribution]
64
- * **Source:** Alexey Grigorev's public datasets repository.
65
-
66
- ---
67
-
68
- ## 🧠 Key Learnings
69
-
70
- 1. **Imputation Strategy:** The **mean imputation** of missing `'horsepower'` values (RMSE: 0.46) significantly outperformed filling with zero (RMSE: 0.51). Using the mean preserves the feature's distribution better than using an extreme outlier (0).
71
- 2. **Model Stability:** By testing the split sensitivity across 10 random seeds, the model was determined to be **stable** ($\text{std} \approx 0.006$), confirming the reliability of the chosen split ratio.
72
- 3. **Regularization Impact:** Ridge regularization (L2) had a negligible effect on the final RMSE scores, suggesting the Linear Regression model was not heavily overfitting the data.
73
-
74
- ---
75
-
76
- ## ✨ Results
77
-
78
- | Question | Task | Result |
79
- | :--- | :--- | :--- |
80
- | **Q1** | Column with missing values | `horsepower` |
81
- | **Q2** | Median Horsepower | 149.0 |
82
- | **Q3** | Best Imputation (Validation RMSE) | **With mean** (0.46) |
83
- | **Q4** | Best Regularization $\text{r}$ (alpha) | **0** (RMSE: 0.51) |
84
- | **Q5** | Standard Deviation of RMSEs | **0.006** |
85
- | **Q6** | Final Test RMSE ($\text{r}=0.001$, seed 9) | $\approx 0.520$ (Closest to 0.515) |
86
-
87
- ---
88
-
89
- ## πŸš€ Future Work
90
-
91
- * **Feature Engineering:** Engineer new features (e.g., age of car from `model_year`) to potentially improve the model's predictive power.
92
- * **Categorical Features:** Incorporate the original categorical columns (`origin`, `fuel_type`, `drivetrain`) using one-hot encoding or other techniques.
93
- * **Advanced Regression:** Test other models like Elastic Net or Random Forest for improved accuracy.
94
-
95
- ---
96
-
97
- ## πŸ“š References
98
-
99
- * *Alexey Grigorev - ML Zoomcamp Week 2 Materials*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Week 2/week_2.ipynb DELETED
The diff for this file is too large to render. See raw diff