sks01dev commited on
Commit
3d827c0
·
1 Parent(s): 4108ad2

Create readme.md

Browse files
Files changed (1) hide show
  1. Week 4/readme.md +63 -0
Week 4/readme.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lead Scoring with Bank Marketing Dataset
2
+
3
+ [![Python](https://img.shields.io/badge/Python-3.12-blue?logo=python&logoColor=white)](https://www.python.org/)
4
+ [![Scikit-Learn](https://img.shields.io/badge/scikit--learn-1.3.2-orange?logo=scikit-learn&logoColor=white)](https://scikit-learn.org/)
5
+ [![Jupyter Notebook](https://img.shields.io/badge/Jupyter-Notebook-orange?logo=jupyter&logoColor=white)](https://jupyter.org/)
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ This notebook demonstrates building a **lead scoring model** using the Bank Marketing dataset. The goal is to predict whether a client will **convert** (sign up for a service) based on various features.
12
+
13
+ We cover:
14
+
15
+ 1. Data preparation and handling missing values.
16
+ 2. Feature importance using ROC AUC for numerical variables.
17
+ 3. Logistic regression modeling with **one-hot encoding**.
18
+ 4. Precision, recall, and F1 score analysis to select thresholds.
19
+ 5. 5-fold cross-validation to check model stability.
20
+ 6. Hyperparameter tuning to select the best regularization parameter.
21
+
22
+ ---
23
+
24
+ ## Key Results
25
+
26
+ - **Best numerical feature (ROC AUC):** `number_of_courses_viewed`
27
+ - **Validation AUC:** `0.794`
28
+ - **Threshold where precision ≈ recall:** `0.59`
29
+ - **Threshold with max F1:** `0.47`
30
+ - **Standard deviation of AUC across folds:** `0.01`
31
+ - **Best regularization parameter C:** `0.001`
32
+
33
+ ---
34
+
35
+ ## Lessons Learned
36
+
37
+ - ROC AUC can help identify predictive features even before modeling.
38
+ - Logistic regression combined with one-hot encoding provides a strong baseline.
39
+ - Threshold tuning is crucial for balancing precision and recall based on business needs.
40
+ - Cross-validation confirms the robustness of the model and prevents overfitting.
41
+ - Hyperparameter tuning improves model performance and reliability.
42
+
43
+ ---
44
+
45
+ ## Environment
46
+
47
+ - Python 3.12
48
+ - Jupyter Notebook
49
+ - Libraries: `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`
50
+
51
+ ---
52
+
53
+ ## Dataset
54
+
55
+ Bank Marketing dataset used in this project is publicly available:
56
+ [Bank Marketing Dataset CSV](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)
57
+
58
+ ---
59
+
60
+ ## Author
61
+
62
+ Created as part of **ML Zoomcamp 2025 Homework 4**.
63
+