Update readme.md
Browse files- Week 2/readme.md +98 -0
Week 2/readme.md
CHANGED
|
@@ -1 +1,99 @@
|
|
|
|
|
| 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Car Fuel Efficiency Predictor (ML Zoomcamp W2)
|
| 2 |
|
| 3 |
+
## βοΈ Description
|
| 4 |
+
|
| 5 |
+
This project develops a **Linear Regression model** to predict car fuel efficiency in Miles Per Gallon (`fuel_efficiency_mpg`) using a subset of car features. The primary focus is on mastering the fundamental machine learning workflow: handling missing data, proper train/validation/test splitting, and understanding model stability and regularization.
|
| 6 |
+
|
| 7 |
+
### Key Objectives:
|
| 8 |
+
|
| 9 |
+
* **Data Preprocessing:** Analyze missing values and compare imputation strategies (mean vs. zero).
|
| 10 |
+
* **Model Training:** Implement a Linear Regression model using `scikit-learn`.
|
| 11 |
+
* **Model Evaluation & Stability:** Use RMSE to evaluate model performance and assess the impact of different random seeds on model stability.
|
| 12 |
+
* **Regularization:** Test the effect of Ridge regularization (L2) to prevent overfitting.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## βοΈ Installation
|
| 17 |
+
|
| 18 |
+
To run this notebook, you need a standard Python environment with the following dependencies.
|
| 19 |
+
|
| 20 |
+
1. **Clone the repository:**
|
| 21 |
+
```bash
|
| 22 |
+
git clone [https://github.com/sks01dev/data-science-lab/](https://github.com/sks01dev/data-science-lab/)
|
| 23 |
+
cd data-science-lab
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
2. **Install dependencies:**
|
| 27 |
+
```bash
|
| 28 |
+
pip install numpy pandas scikit-learn matplotlib seaborn
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
3. **Download Data:** The dataset is fetched directly within the notebook from the official source:
|
| 32 |
+
```bash
|
| 33 |
+
wget [https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv)
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## π Usage
|
| 39 |
+
|
| 40 |
+
Execute the `week_2.ipynb` notebook cell-by-cell to follow the full machine learning workflow.
|
| 41 |
+
|
| 42 |
+
1. **Data Selection:** Only the following columns are used: `'engine_displacement'`, `'horsepower'`, `'vehicle_weight'`, `'model_year'`, and `'fuel_efficiency_mpg'`.
|
| 43 |
+
2. **Splitting:** The data is shuffled (seed 42) and split into Train (60%), Validation (20%), and Test (20%) sets.
|
| 44 |
+
3. **Model Comparison:** Linear Regression models are trained to compare performance after filling missing `'horsepower'` values with 0 vs. the training set mean.
|
| 45 |
+
4. **Final Evaluation:** The best configuration is tested on the final, unseen test set.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## π οΈ Technologies Used
|
| 50 |
+
|
| 51 |
+
| Technology | Purpose | Badge/Icon |
|
| 52 |
+
| :--- | :--- | :--- |
|
| 53 |
+
| **Python** | Core programming and analysis language | [](https://www.python.org/doc/) |
|
| 54 |
+
| **NumPy** | Numerical operations and RMSE calculation | [](https://numpy.org/doc/) |
|
| 55 |
+
| **Pandas** | Data loading, manipulation, and missing value handling | [](https://pandas.pydata.org/docs/) |
|
| 56 |
+
| **scikit-learn** | Linear Regression, Ridge, and `train_test_split` | [](https://scikit-learn.org/stable/documentation.html) |
|
| 57 |
+
| **Seaborn/Matplotlib** | Data visualization and distribution checks | [](https://seaborn.pydata.org/) |
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## π Dataset Used
|
| 62 |
+
|
| 63 |
+
* **Car Fuel Efficiency Dataset:** A collection of vehicle attributes (displacement, horsepower, weight, model year) used to predict fuel consumption (`fuel_efficiency_mpg`). [Image of Fuel Efficiency Distribution]
|
| 64 |
+
* **Source:** Alexey Grigorev's public datasets repository.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## π§ Key Learnings
|
| 69 |
+
|
| 70 |
+
1. **Imputation Strategy:** The **mean imputation** of missing `'horsepower'` values (RMSE: 0.46) significantly outperformed filling with zero (RMSE: 0.51). Using the mean preserves the feature's distribution better than using an extreme outlier (0).
|
| 71 |
+
2. **Model Stability:** By testing the split sensitivity across 10 random seeds, the model was determined to be **stable** ($\text{std} \approx 0.006$), confirming the reliability of the chosen split ratio.
|
| 72 |
+
3. **Regularization Impact:** Ridge regularization (L2) had a negligible effect on the final RMSE scores, suggesting the Linear Regression model was not heavily overfitting the data.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## β¨ Results
|
| 77 |
+
|
| 78 |
+
| Question | Task | Result |
|
| 79 |
+
| :--- | :--- | :--- |
|
| 80 |
+
| **Q1** | Column with missing values | `horsepower` |
|
| 81 |
+
| **Q2** | Median Horsepower | 149.0 |
|
| 82 |
+
| **Q3** | Best Imputation (Validation RMSE) | **With mean** (0.46) |
|
| 83 |
+
| **Q4** | Best Regularization $\text{r}$ (alpha) | **0** (RMSE: 0.51) |
|
| 84 |
+
| **Q5** | Standard Deviation of RMSEs | **0.006** |
|
| 85 |
+
| **Q6** | Final Test RMSE ($\text{r}=0.001$, seed 9) | $\approx 0.520$ (Closest to 0.515) |
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## π Future Work
|
| 90 |
+
|
| 91 |
+
* **Feature Engineering:** Engineer new features (e.g., age of car from `model_year`) to potentially improve the model's predictive power.
|
| 92 |
+
* **Categorical Features:** Incorporate the original categorical columns (`origin`, `fuel_type`, `drivetrain`) using one-hot encoding or other techniques.
|
| 93 |
+
* **Advanced Regression:** Test other models like Elastic Net or Random Forest for improved accuracy.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## π References
|
| 98 |
+
|
| 99 |
+
* *Alexey Grigorev - ML Zoomcamp Week 2 Materials*
|