Create readme.md
Browse files- Week 3/readme.md +100 -0
Week 3/readme.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Machine Learning Zoomcamp 2025 - Homework 3
|
| 2 |
+
|
| 3 |
+
[](https://www.python.org/)
|
| 4 |
+
[](https://pandas.pydata.org/)
|
| 5 |
+
[](https://scikit-learn.org/stable/)
|
| 6 |
+
[](https://jupyter.org/)
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Homework 3: Machine Learning for Classification
|
| 11 |
+
|
| 12 |
+
This repository contains solutions for **Homework 3** of **Machine Learning Zoomcamp 2025**, focused on **classification tasks** using the Bank Marketing dataset.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## π Project Overview
|
| 17 |
+
|
| 18 |
+
- **Dataset:** [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
|
| 19 |
+
- **Target variable:** `converted` (whether the client signed up)
|
| 20 |
+
- **Objective:** Data preprocessing, exploratory analysis, feature selection, and training logistic regression models (regularized and unregularized).
|
| 21 |
+
|
| 22 |
+
**Tech Stack:**
|
| 23 |
+
- **Python 3.11** β core programming language
|
| 24 |
+
- **Pandas** β data manipulation
|
| 25 |
+
- **NumPy** β numerical operations
|
| 26 |
+
- **Scikit-Learn** β machine learning models, feature selection, evaluation
|
| 27 |
+
- **Jupyter Notebook** β interactive coding and documentation
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## πΉ Questions & Answers
|
| 32 |
+
|
| 33 |
+
| Question | Task | Answer |
|
| 34 |
+
|----------|------|--------|
|
| 35 |
+
| 1 | Mode of `industry` | `retail` |
|
| 36 |
+
| 2 | Biggest correlation (numerical features) | `annual_income` and `interaction_count` |
|
| 37 |
+
| 3 | Biggest mutual information (categorical features) | `lead_source` |
|
| 38 |
+
| 4 | Logistic regression validation accuracy | 0.74 |
|
| 39 |
+
| 5 | Least useful feature (feature elimination) | `lead_score` |
|
| 40 |
+
| 6 | Best `C` value for regularized logistic regression | 1 |
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## π Approach / Key Steps
|
| 45 |
+
|
| 46 |
+
1. **Data Cleaning & Preparation**
|
| 47 |
+
- Filled missing values: categorical β `'NA'`, numerical β `0.0`
|
| 48 |
+
- Verified feature types and correlations
|
| 49 |
+
|
| 50 |
+
2. **Exploratory Analysis**
|
| 51 |
+
- Mode of categorical variables
|
| 52 |
+
- Correlation matrix for numerical features
|
| 53 |
+
|
| 54 |
+
3. **Feature Selection**
|
| 55 |
+
- Calculated mutual information for categorical variables using `mutual_info_score`
|
| 56 |
+
- Identified least useful features via feature elimination
|
| 57 |
+
|
| 58 |
+
4. **Model Training**
|
| 59 |
+
- Logistic Regression with one-hot encoded categorical variables
|
| 60 |
+
- Regularized logistic regression with hyperparameter tuning (`C` values)
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## π Results
|
| 65 |
+
|
| 66 |
+
- Baseline logistic regression accuracy: **0.74**
|
| 67 |
+
- Least useful feature: **`lead_score`**
|
| 68 |
+
- Best regularization parameter `C`: **1**
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## β How to Run
|
| 73 |
+
|
| 74 |
+
1. Clone the repository:
|
| 75 |
+
```bash
|
| 76 |
+
git clone https://github.com/yourusername/ml-zoomcamp-hw3.git
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
2. Install requirements:
|
| 80 |
+
```bash
|
| 81 |
+
pip install -r requirements.txt
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
3. Open the Jupyter Notebook and run cells sequentially:
|
| 85 |
+
```bash
|
| 86 |
+
jupyter notebook
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## π References
|
| 92 |
+
|
| 93 |
+
- [Bank Marketing Dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
|
| 94 |
+
- [Scikit-Learn Documentation](https://scikit-learn.org/stable/)
|
| 95 |
+
- [Pandas Documentation](https://pandas.pydata.org/)
|
| 96 |
+
- [NumPy Documentation](https://numpy.org/)
|
| 97 |
+
- [Jupyter Notebook Documentation](https://jupyter.org/)
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|