Spaces:
Running
Running
Commit Β·
854c114
1
Parent(s): 84b67b2
feat: synchronize ML module files
Browse files- ML +0 -1
- ML/01_Python_Core_Mastery.ipynb +342 -0
- ML/02_Statistics_Foundations.ipynb +226 -0
- ML/03_NumPy_Practice.ipynb +202 -0
- ML/04_Pandas_Practice.ipynb +203 -0
- ML/05_Matplotlib_Seaborn_Practice.ipynb +210 -0
- ML/06_EDA_and_Feature_Engineering.ipynb +449 -0
- ML/07_Scikit_Learn_Practice.ipynb +214 -0
- ML/08_Linear_Regression.ipynb +277 -0
- ML/09_Logistic_Regression.ipynb +228 -0
- ML/10_Support_Vector_Machines.ipynb +196 -0
- ML/11_K_Nearest_Neighbors.ipynb +201 -0
- ML/12_Naive_Bayes.ipynb +162 -0
- ML/13_Decision_Trees_and_Random_Forests.ipynb +258 -0
- ML/14_Gradient_Boosting_XGBoost.ipynb +159 -0
- ML/15_KMeans_Clustering.ipynb +195 -0
- ML/16_Dimensionality_Reduction_PCA.ipynb +168 -0
- ML/17_Neural_Networks_Deep_Learning.ipynb +166 -0
- ML/18_Time_Series_Analysis.ipynb +159 -0
- ML/19_Natural_Language_Processing_NLP.ipynb +162 -0
- ML/20_Reinforcement_Learning_Basics.ipynb +194 -0
- ML/21_Kaggle_Project_Medical_Costs.ipynb +270 -0
- ML/22_SQL_for_Data_Science.ipynb +165 -0
- ML/23_Model_Explainability_SHAP.ipynb +158 -0
- ML/24_Deep_Learning_TensorFlow.ipynb +231 -0
- ML/25_Model_Deployment_Streamlit.ipynb +176 -0
- ML/26_End_to_End_ML_Project.ipynb +298 -0
- ML/CURRICULUM_REVIEW.md +229 -0
- ML/README.md +163 -0
- ML/README_Resources.md +29 -0
- ML/requirements.txt +13 -0
ML
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
Subproject commit 2b1395d13320096ad4915405782fdba6d287b5d5
|
|
|
|
|
|
ML/01_Python_Core_Mastery.ipynb
ADDED
|
@@ -0,0 +1,342 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Python Mastery: The COMPLETE Practice Notebook\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"This is your one-stop shop for mastering Core Python. To be a professional Data Scientist, you don't just need libraries; you need to understand the language that powers them. This notebook covers every major concept from basic types to Multithreading and Software Design Patterns.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Complete Curriculum:\n",
|
| 12 |
+
"1. **Basics**: Types, Strings, F-Strings, and Slicing.\n",
|
| 13 |
+
"2. **Data Structures**: Lists, Dictionaries, Tuples, and Sets.\n",
|
| 14 |
+
"3. **Control Flow**: Loops, Conditionals, Enumerate, and Zip.\n",
|
| 15 |
+
"4. **Productivity**: List/Dict Comprehensions & Generators.\n",
|
| 16 |
+
"5. **Functions**: Args, Kwargs, Lambdas, and Decorators.\n",
|
| 17 |
+
"6. **OOP (Advanced)**: Inheritance, Dunder Methods, and Static Methods.\n",
|
| 18 |
+
"7. **High-Level Programming**: Asynchronous Python (Async/Await).\n",
|
| 19 |
+
"8. **Concurrency**: Multithreading and Multi-processing.\n",
|
| 20 |
+
"9. **Software Design Patterns**: Singleton and Factory Patterns.\n",
|
| 21 |
+
"10. **Systems**: File I/O, Error Handling, and Datetime.\n",
|
| 22 |
+
"\n",
|
| 23 |
+
"---"
|
| 24 |
+
]
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"cell_type": "markdown",
|
| 28 |
+
"metadata": {},
|
| 29 |
+
"source": [
|
| 30 |
+
"## 1. Strings, F-Strings & Slicing\n",
|
| 31 |
+
"\n",
|
| 32 |
+
"### Task 1: Formatting & Slicing\n",
|
| 33 |
+
"1. Use f-strings to print `pi = 3.14159` to 2 decimal places.\n",
|
| 34 |
+
"2. Reverse the string `\"DataScience\"` using slicing."
|
| 35 |
+
]
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"cell_type": "code",
|
| 39 |
+
"execution_count": null,
|
| 40 |
+
"metadata": {},
|
| 41 |
+
"outputs": [],
|
| 42 |
+
"source": [
|
| 43 |
+
"pi = 3.14159\n",
|
| 44 |
+
"s = \"DataScience\"\n",
|
| 45 |
+
"# YOUR CODE HERE"
|
| 46 |
+
]
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
"cell_type": "markdown",
|
| 50 |
+
"metadata": {},
|
| 51 |
+
"source": [
|
| 52 |
+
"<details>\n",
|
| 53 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 54 |
+
"\n",
|
| 55 |
+
"```python\n",
|
| 56 |
+
"print(f\"Pi: {pi:.2f}\")\n",
|
| 57 |
+
"print(s[::-1])\n",
|
| 58 |
+
"```\n",
|
| 59 |
+
"</details>"
|
| 60 |
+
]
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"cell_type": "markdown",
|
| 64 |
+
"metadata": {},
|
| 65 |
+
"source": [
|
| 66 |
+
"## 2. Advanced Data Structures\n",
|
| 67 |
+
"\n",
|
| 68 |
+
"### Task 2: Dictionaries & Sets\n",
|
| 69 |
+
"1. Convert the list `[1, 2, 2, 3, 3, 3]` to a set to find unique values.\n",
|
| 70 |
+
"2. Given `d = {'a': 1, 'b': 2}`, print all keys and values using a loop and `.items()`."
|
| 71 |
+
]
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"cell_type": "code",
|
| 75 |
+
"execution_count": null,
|
| 76 |
+
"metadata": {},
|
| 77 |
+
"outputs": [],
|
| 78 |
+
"source": [
|
| 79 |
+
"d = {'a': 1, 'b': 2}\n",
|
| 80 |
+
"# YOUR CODE HERE"
|
| 81 |
+
]
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"cell_type": "markdown",
|
| 85 |
+
"metadata": {},
|
| 86 |
+
"source": [
|
| 87 |
+
"<details>\n",
|
| 88 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"```python\n",
|
| 91 |
+
"unique_vals = set([1, 2, 2, 3, 3, 3])\n",
|
| 92 |
+
"for k, v in d.items():\n",
|
| 93 |
+
" print(f\"Key: {k}, Value: {v}\")\n",
|
| 94 |
+
"```\n",
|
| 95 |
+
"</details>"
|
| 96 |
+
]
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"cell_type": "markdown",
|
| 100 |
+
"metadata": {},
|
| 101 |
+
"source": [
|
| 102 |
+
"## 3. Control Flow: Enumerate & Zip\n",
|
| 103 |
+
"\n",
|
| 104 |
+
"### Task 3: Pairing Data\n",
|
| 105 |
+
"Combine `names = ['Alice', 'Bob']` and `ages = [25, 30]` using `zip` and print them as pairs."
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "code",
|
| 110 |
+
"execution_count": null,
|
| 111 |
+
"metadata": {},
|
| 112 |
+
"outputs": [],
|
| 113 |
+
"source": [
|
| 114 |
+
"names = ['Alice', 'Bob']\n",
|
| 115 |
+
"ages = [25, 30]\n",
|
| 116 |
+
"# YOUR CODE HERE"
|
| 117 |
+
]
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"cell_type": "markdown",
|
| 121 |
+
"metadata": {},
|
| 122 |
+
"source": [
|
| 123 |
+
"<details>\n",
|
| 124 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 125 |
+
"\n",
|
| 126 |
+
"```python\n",
|
| 127 |
+
"for name, age in zip(names, ages):\n",
|
| 128 |
+
" print(f\"{name} is {age} years old\")\n",
|
| 129 |
+
"```\n",
|
| 130 |
+
"</details>"
|
| 131 |
+
]
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"cell_type": "markdown",
|
| 135 |
+
"metadata": {},
|
| 136 |
+
"source": [
|
| 137 |
+
"## 4. Advanced Functions: Decorators & Generators\n",
|
| 138 |
+
"\n",
|
| 139 |
+
"### Task 4.1: Custom Decorator\n",
|
| 140 |
+
"Create a decorator called `@timer` that prints \"Starting...\" before a function runs and \"Finished!\" after it runs."
|
| 141 |
+
]
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"cell_type": "code",
|
| 145 |
+
"execution_count": null,
|
| 146 |
+
"metadata": {},
|
| 147 |
+
"outputs": [],
|
| 148 |
+
"source": [
|
| 149 |
+
"# YOUR CODE HERE"
|
| 150 |
+
]
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"cell_type": "markdown",
|
| 154 |
+
"metadata": {},
|
| 155 |
+
"source": [
|
| 156 |
+
"<details>\n",
|
| 157 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 158 |
+
"\n",
|
| 159 |
+
"```python\n",
|
| 160 |
+
"def timer(func):\n",
|
| 161 |
+
" def wrapper(*args, **kwargs):\n",
|
| 162 |
+
" print(\"Starting...\")\n",
|
| 163 |
+
" result = func(*args, **kwargs)\n",
|
| 164 |
+
" print(\"Finished!\")\n",
|
| 165 |
+
" return result\n",
|
| 166 |
+
" return wrapper\n",
|
| 167 |
+
"\n",
|
| 168 |
+
"@timer\n",
|
| 169 |
+
"def say_hello():\n",
|
| 170 |
+
" print(\"Hello!\")\n",
|
| 171 |
+
"\n",
|
| 172 |
+
"say_hello()\n",
|
| 173 |
+
"```\n",
|
| 174 |
+
"</details>"
|
| 175 |
+
]
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"cell_type": "markdown",
|
| 179 |
+
"metadata": {},
|
| 180 |
+
"source": [
|
| 181 |
+
"## 5. Object-Oriented Programming (Advanced)\n",
|
| 182 |
+
"\n",
|
| 183 |
+
"### Task 5: Dunder Methods & Static Methods\n",
|
| 184 |
+
"Create a class `Book` that:\n",
|
| 185 |
+
"1. Uses `__init__` for `title` and `author`.\n",
|
| 186 |
+
"2. Uses `__str__` to return `\"[Title] by [Author]\"`.\n",
|
| 187 |
+
"3. Has a `@staticmethod` called `is_valid_isbn(isbn)` that returns True if length is 13."
|
| 188 |
+
]
|
| 189 |
+
},
|
| 190 |
+
{
|
| 191 |
+
"cell_type": "code",
|
| 192 |
+
"execution_count": null,
|
| 193 |
+
"metadata": {},
|
| 194 |
+
"outputs": [],
|
| 195 |
+
"source": [
|
| 196 |
+
"# YOUR CODE HERE"
|
| 197 |
+
]
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"cell_type": "markdown",
|
| 201 |
+
"metadata": {},
|
| 202 |
+
"source": [
|
| 203 |
+
"<details>\n",
|
| 204 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 205 |
+
"\n",
|
| 206 |
+
"```python\n",
|
| 207 |
+
"class Book:\n",
|
| 208 |
+
" def __init__(self, title, author):\n",
|
| 209 |
+
" self.title = title\n",
|
| 210 |
+
" self.author = author\n",
|
| 211 |
+
" \n",
|
| 212 |
+
" def __str__(self):\n",
|
| 213 |
+
" return f\"{self.title} by {self.author}\"\n",
|
| 214 |
+
" \n",
|
| 215 |
+
" @staticmethod\n",
|
| 216 |
+
" def is_valid_isbn(isbn):\n",
|
| 217 |
+
" return len(str(isbn)) == 13\n",
|
| 218 |
+
"\n",
|
| 219 |
+
"b = Book(\"1984\", \"George Orwell\")\n",
|
| 220 |
+
"print(b)\n",
|
| 221 |
+
"print(Book.is_valid_isbn(1234567890123))\n",
|
| 222 |
+
"```\n",
|
| 223 |
+
"</details>"
|
| 224 |
+
]
|
| 225 |
+
},
|
| 226 |
+
{
|
| 227 |
+
"cell_type": "markdown",
|
| 228 |
+
"metadata": {},
|
| 229 |
+
"source": [
|
| 230 |
+
"## 6. High-Level Concepts: Concurrency\n",
|
| 231 |
+
"\n",
|
| 232 |
+
"### Task 6: Multithreading vs Multi-processing\n",
|
| 233 |
+
"Explain in a comment why you would use `threading` for I/O tasks and `multiprocessing` for CPU-bound tasks in Python (Hint: GIL)."
|
| 234 |
+
]
|
| 235 |
+
},
|
| 236 |
+
{
|
| 237 |
+
"cell_type": "code",
|
| 238 |
+
"execution_count": null,
|
| 239 |
+
"metadata": {},
|
| 240 |
+
"outputs": [],
|
| 241 |
+
"source": [
|
| 242 |
+
"import threading\n",
|
| 243 |
+
"import multiprocessing\n",
|
| 244 |
+
"\n",
|
| 245 |
+
"# YOUR ANSWER HERE (AS A COMMENT)"
|
| 246 |
+
]
|
| 247 |
+
},
|
| 248 |
+
{
|
| 249 |
+
"cell_type": "markdown",
|
| 250 |
+
"metadata": {},
|
| 251 |
+
"source": [
|
| 252 |
+
"<details>\n",
|
| 253 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 254 |
+
"\n",
|
| 255 |
+
"```python\n",
|
| 256 |
+
"# Multithreading: Efficient for I/O-bound tasks (like waiting for a web response)\n",
|
| 257 |
+
"# because the GIL (Global Interpreter Lock) prevents multiple threads from \n",
|
| 258 |
+
"# executing Python bytecode at once, but allows waiting for I/O.\n",
|
| 259 |
+
"\n",
|
| 260 |
+
"# Multiprocessing: Efficient for CPU-bound tasks (like heavy math/ML matrix multiplication)\n",
|
| 261 |
+
"# because it creates separate memory spaces and separate GILs for each process,\n",
|
| 262 |
+
"# bypassing the GIL limitation entirely.\n",
|
| 263 |
+
"```\n",
|
| 264 |
+
"</details>"
|
| 265 |
+
]
|
| 266 |
+
},
|
| 267 |
+
{
|
| 268 |
+
"cell_type": "markdown",
|
| 269 |
+
"metadata": {},
|
| 270 |
+
"source": [
|
| 271 |
+
"## 7. Software Design Patterns\n",
|
| 272 |
+
"\n",
|
| 273 |
+
"### Task 7: The Singleton Pattern\n",
|
| 274 |
+
"Implement a Singleton class called `DatabaseConnection` that ensures only one instance of the class can ever be created."
|
| 275 |
+
]
|
| 276 |
+
},
|
| 277 |
+
{
|
| 278 |
+
"cell_type": "code",
|
| 279 |
+
"execution_count": null,
|
| 280 |
+
"metadata": {},
|
| 281 |
+
"outputs": [],
|
| 282 |
+
"source": [
|
| 283 |
+
"# YOUR CODE HERE"
|
| 284 |
+
]
|
| 285 |
+
},
|
| 286 |
+
{
|
| 287 |
+
"cell_type": "markdown",
|
| 288 |
+
"metadata": {},
|
| 289 |
+
"source": [
|
| 290 |
+
"<details>\n",
|
| 291 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 292 |
+
"\n",
|
| 293 |
+
"```python\n",
|
| 294 |
+
"class DatabaseConnection:\n",
|
| 295 |
+
" _instance = None\n",
|
| 296 |
+
" \n",
|
| 297 |
+
" def __new__(cls):\n",
|
| 298 |
+
" if cls._instance is None:\n",
|
| 299 |
+
" print(\"Initializing new database connection instance...\")\n",
|
| 300 |
+
" cls._instance = super(DatabaseConnection, cls).__new__(cls)\n",
|
| 301 |
+
" return cls._instance\n",
|
| 302 |
+
"\n",
|
| 303 |
+
"db1 = DatabaseConnection()\n",
|
| 304 |
+
"db2 = DatabaseConnection()\n",
|
| 305 |
+
"print(\"Are they the same instance?\", db1 is db2)\n",
|
| 306 |
+
"```\n",
|
| 307 |
+
"</details>"
|
| 308 |
+
]
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"cell_type": "markdown",
|
| 312 |
+
"metadata": {},
|
| 313 |
+
"source": [
|
| 314 |
+
"--- \n",
|
| 315 |
+
"### π You are now a Python Master Engineer! \n",
|
| 316 |
+
"With these additions, you have covered everything from basic variables to Singleton patterns and GIL-based concurrency. \n",
|
| 317 |
+
"You are fully prepared to build high-scale machine learning systems."
|
| 318 |
+
]
|
| 319 |
+
}
|
| 320 |
+
],
|
| 321 |
+
"metadata": {
|
| 322 |
+
"kernelspec": {
|
| 323 |
+
"display_name": "Python 3",
|
| 324 |
+
"language": "python",
|
| 325 |
+
"name": "python3"
|
| 326 |
+
},
|
| 327 |
+
"language_info": {
|
| 328 |
+
"codemirror_mode": {
|
| 329 |
+
"name": "ipython",
|
| 330 |
+
"version": 3
|
| 331 |
+
},
|
| 332 |
+
"file_extension": ".py",
|
| 333 |
+
"mimetype": "text/x-python",
|
| 334 |
+
"name": "python",
|
| 335 |
+
"nbconvert_exporter": "python",
|
| 336 |
+
"pygments_lexer": "ipython3",
|
| 337 |
+
"version": "3.12.7"
|
| 338 |
+
}
|
| 339 |
+
},
|
| 340 |
+
"nbformat": 4,
|
| 341 |
+
"nbformat_minor": 4
|
| 342 |
+
}
|
ML/02_Statistics_Foundations.ipynb
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 02 - Statistical Foundations\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Before diving into Machine Learning, it's essential to understand the data through **Statistics**. This module covers the foundational concepts you'll need for data analysis.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Complete Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/)** on your hub for interactive demos on Population vs. Sample, Central Tendency, and Dispersion.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Central Tendency**: Mean, Median, and Mode.\n",
|
| 16 |
+
"2. **Dispersion**: Standard Deviation, Variance, and IQR.\n",
|
| 17 |
+
"3. **Probability Distributions**: Normal Distribution and Z-Scores.\n",
|
| 18 |
+
"4. **Hypothesis Testing**: Understanding p-values.\n",
|
| 19 |
+
"5. **Correlation**: Relationship between variables.\n",
|
| 20 |
+
"\n",
|
| 21 |
+
"---"
|
| 22 |
+
]
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"cell_type": "markdown",
|
| 26 |
+
"metadata": {},
|
| 27 |
+
"source": [
|
| 28 |
+
"## 1. Setup"
|
| 29 |
+
]
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
"cell_type": "code",
|
| 33 |
+
"execution_count": null,
|
| 34 |
+
"metadata": {},
|
| 35 |
+
"outputs": [],
|
| 36 |
+
"source": [
|
| 37 |
+
"import pandas as pd\n",
|
| 38 |
+
"import numpy as np\n",
|
| 39 |
+
"import matplotlib.pyplot as plt\n",
|
| 40 |
+
"import seaborn as sns\n",
|
| 41 |
+
"from scipy import stats\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"np.random.seed(42)\n",
|
| 44 |
+
"data = np.random.normal(loc=100, scale=15, size=1000)\n",
|
| 45 |
+
"df = pd.DataFrame(data, columns=['Score'])\n",
|
| 46 |
+
"df.head()"
|
| 47 |
+
]
|
| 48 |
+
},
|
| 49 |
+
{
|
| 50 |
+
"cell_type": "markdown",
|
| 51 |
+
"metadata": {},
|
| 52 |
+
"source": [
|
| 53 |
+
"## 2. Central Tendency & Dispersion\n",
|
| 54 |
+
"\n",
|
| 55 |
+
"### Task 1: Basic Stats\n",
|
| 56 |
+
"Calculate the Mean, Median, and Standard Deviation of the `Score` column."
|
| 57 |
+
]
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"cell_type": "code",
|
| 61 |
+
"execution_count": null,
|
| 62 |
+
"metadata": {},
|
| 63 |
+
"outputs": [],
|
| 64 |
+
"source": [
|
| 65 |
+
"# YOUR CODE HERE"
|
| 66 |
+
]
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"cell_type": "markdown",
|
| 70 |
+
"metadata": {},
|
| 71 |
+
"source": [
|
| 72 |
+
"<details>\n",
|
| 73 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 74 |
+
"\n",
|
| 75 |
+
"```python\n",
|
| 76 |
+
"print(f\"Mean: {df['Score'].mean()}\")\n",
|
| 77 |
+
"print(f\"Median: {df['Score'].median()}\")\n",
|
| 78 |
+
"print(f\"Std Dev: {df['Score'].std()}\")\n",
|
| 79 |
+
"```\n",
|
| 80 |
+
"</details>"
|
| 81 |
+
]
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"cell_type": "markdown",
|
| 85 |
+
"metadata": {},
|
| 86 |
+
"source": [
|
| 87 |
+
"## 3. Z-Scores & Outliers\n",
|
| 88 |
+
"\n",
|
| 89 |
+
"### Task 2: Finding Outliers\n",
|
| 90 |
+
"A point is often considered an outlier if its Z-score is greater than 3 or less than -3. Help identify any outliers in the dataset.\n",
|
| 91 |
+
"\n",
|
| 92 |
+
"*Web Reference: [Outlier Detection Demo](https://aashishgarg13.github.io/DataScience/feature-engineering/)*"
|
| 93 |
+
]
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"cell_type": "code",
|
| 97 |
+
"execution_count": null,
|
| 98 |
+
"metadata": {},
|
| 99 |
+
"outputs": [],
|
| 100 |
+
"source": [
|
| 101 |
+
"# YOUR CODE HERE"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "markdown",
|
| 106 |
+
"metadata": {},
|
| 107 |
+
"source": [
|
| 108 |
+
"<details>\n",
|
| 109 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 110 |
+
"\n",
|
| 111 |
+
"```python\n",
|
| 112 |
+
"df['z_score'] = stats.zscore(df['Score'])\n",
|
| 113 |
+
"outliers = df[df['z_score'].abs() > 3]\n",
|
| 114 |
+
"print(f\"Number of outliers: {len(outliers)}\")\n",
|
| 115 |
+
"print(outliers)\n",
|
| 116 |
+
"```\n",
|
| 117 |
+
"</details>"
|
| 118 |
+
]
|
| 119 |
+
},
|
| 120 |
+
{
|
| 121 |
+
"cell_type": "markdown",
|
| 122 |
+
"metadata": {},
|
| 123 |
+
"source": [
|
| 124 |
+
"## 4. Correlation\n",
|
| 125 |
+
"\n",
|
| 126 |
+
"### Task 3: Correlation Matrix\n",
|
| 127 |
+
"Generate a second column `StudyTime` that is correlated with `Score` and calculate the Pearson correlation coefficient."
|
| 128 |
+
]
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"cell_type": "code",
|
| 132 |
+
"execution_count": null,
|
| 133 |
+
"metadata": {},
|
| 134 |
+
"outputs": [],
|
| 135 |
+
"source": [
|
| 136 |
+
"df['StudyTime'] = df['Score'] * 0.5 + np.random.normal(0, 5, 1000)\n",
|
| 137 |
+
"# YOUR CODE HERE"
|
| 138 |
+
]
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"cell_type": "markdown",
|
| 142 |
+
"metadata": {},
|
| 143 |
+
"source": [
|
| 144 |
+
"<details>\n",
|
| 145 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"```python\n",
|
| 148 |
+
"correlation = df.corr()\n",
|
| 149 |
+
"print(correlation)\n",
|
| 150 |
+
"sns.heatmap(correlation, annot=True, cmap='coolwarm')\n",
|
| 151 |
+
"plt.show()\n",
|
| 152 |
+
"```\n",
|
| 153 |
+
"</details>"
|
| 154 |
+
]
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"cell_type": "markdown",
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"source": [
|
| 160 |
+
"## 5. Hypothesis Testing (p-values)\n",
|
| 161 |
+
"\n",
|
| 162 |
+
"### Task 4: T-Test\n",
|
| 163 |
+
"Test if the mean of our `Score` is significantly different from 100 using a 1-sample T-test. What is the p-value?"
|
| 164 |
+
]
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"cell_type": "code",
|
| 168 |
+
"execution_count": null,
|
| 169 |
+
"metadata": {},
|
| 170 |
+
"outputs": [],
|
| 171 |
+
"source": [
|
| 172 |
+
"# YOUR CODE HERE"
|
| 173 |
+
]
|
| 174 |
+
},
|
| 175 |
+
{
|
| 176 |
+
"cell_type": "markdown",
|
| 177 |
+
"metadata": {},
|
| 178 |
+
"source": [
|
| 179 |
+
"<details>\n",
|
| 180 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 181 |
+
"\n",
|
| 182 |
+
"```python\n",
|
| 183 |
+
"t_stat, p_val = stats.ttest_1samp(df['Score'], 100)\n",
|
| 184 |
+
"print(f\"T-statistic: {t_stat}\")\n",
|
| 185 |
+
"print(f\"P-value: {p_val}\")\n",
|
| 186 |
+
"if p_val < 0.05:\n",
|
| 187 |
+
" print(\"Statistically significant difference!\")\n",
|
| 188 |
+
"else:\n",
|
| 189 |
+
" print(\"No significant difference.\")\n",
|
| 190 |
+
"```\n",
|
| 191 |
+
"</details>"
|
| 192 |
+
]
|
| 193 |
+
},
|
| 194 |
+
{
|
| 195 |
+
"cell_type": "markdown",
|
| 196 |
+
"metadata": {},
|
| 197 |
+
"source": [
|
| 198 |
+
"--- \n",
|
| 199 |
+
"### Foundational Knowledge Unlocked! \n",
|
| 200 |
+
"You have now mastered the mathematical core of data analysis.\n",
|
| 201 |
+
"Next: **NumPy Mastery**."
|
| 202 |
+
]
|
| 203 |
+
}
|
| 204 |
+
],
|
| 205 |
+
"metadata": {
|
| 206 |
+
"kernelspec": {
|
| 207 |
+
"display_name": "Python 3",
|
| 208 |
+
"language": "python",
|
| 209 |
+
"name": "python3"
|
| 210 |
+
},
|
| 211 |
+
"language_info": {
|
| 212 |
+
"codemirror_mode": {
|
| 213 |
+
"name": "ipython",
|
| 214 |
+
"version": 3
|
| 215 |
+
},
|
| 216 |
+
"file_extension": ".py",
|
| 217 |
+
"mimetype": "text/x-python",
|
| 218 |
+
"name": "python",
|
| 219 |
+
"nbconvert_exporter": "python",
|
| 220 |
+
"pygments_lexer": "ipython3",
|
| 221 |
+
"version": "3.12.7"
|
| 222 |
+
}
|
| 223 |
+
},
|
| 224 |
+
"nbformat": 4,
|
| 225 |
+
"nbformat_minor": 4
|
| 226 |
+
}
|
ML/03_NumPy_Practice.ipynb
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Python Library Practice: NumPy\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"NumPy is the fundamental package for scientific computing in Python. It provides high-performance multidimensional array objects and tools for working with them.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section on your hub for Linear Algebra concepts that use NumPy.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Array Creation**: Create arrays from lists and using built-in functions.\n",
|
| 16 |
+
"2. **Array Operations**: Element-wise math and broadcasting.\n",
|
| 17 |
+
"3. **Indexing & Slicing**: Selecting specific data points.\n",
|
| 18 |
+
"4. **Linear Algebra**: Matrix multiplication and dot products.\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"---"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "markdown",
|
| 25 |
+
"metadata": {},
|
| 26 |
+
"source": [
|
| 27 |
+
"## 1. Array Creation\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"### Task 1: Create Basics\n",
|
| 30 |
+
"1. Create a 1D array of numbers from 0 to 9.\n",
|
| 31 |
+
"2. Create a 3x3 identity matrix."
|
| 32 |
+
]
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"cell_type": "code",
|
| 36 |
+
"execution_count": null,
|
| 37 |
+
"metadata": {},
|
| 38 |
+
"outputs": [],
|
| 39 |
+
"source": [
|
| 40 |
+
"import numpy as np\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# YOUR CODE HERE\n"
|
| 43 |
+
]
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"cell_type": "markdown",
|
| 47 |
+
"metadata": {},
|
| 48 |
+
"source": [
|
| 49 |
+
"<details>\n",
|
| 50 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 51 |
+
"\n",
|
| 52 |
+
"```python\n",
|
| 53 |
+
"arr1 = np.arange(10)\n",
|
| 54 |
+
"identity = np.eye(3)\n",
|
| 55 |
+
"print(arr1)\n",
|
| 56 |
+
"print(identity)\n",
|
| 57 |
+
"```\n",
|
| 58 |
+
"</details>"
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"cell_type": "markdown",
|
| 63 |
+
"metadata": {},
|
| 64 |
+
"source": [
|
| 65 |
+
"## 2. Array Operations\n",
|
| 66 |
+
"\n",
|
| 67 |
+
"### Task 2: Vector Math\n",
|
| 68 |
+
"Given two arrays `a = [10, 20, 30]` and `b = [1, 2, 3]`, perform addition, subtraction, and element-wise multiplication."
|
| 69 |
+
]
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"cell_type": "code",
|
| 73 |
+
"execution_count": null,
|
| 74 |
+
"metadata": {},
|
| 75 |
+
"outputs": [],
|
| 76 |
+
"source": [
|
| 77 |
+
"a = np.array([10, 20, 30])\n",
|
| 78 |
+
"b = np.array([1, 2, 3])\n",
|
| 79 |
+
"\n",
|
| 80 |
+
"# YOUR CODE HERE\n"
|
| 81 |
+
]
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"cell_type": "markdown",
|
| 85 |
+
"metadata": {},
|
| 86 |
+
"source": [
|
| 87 |
+
"<details>\n",
|
| 88 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"```python\n",
|
| 91 |
+
"print(\"Add:\", a + b)\n",
|
| 92 |
+
"print(\"Sub:\", a - b)\n",
|
| 93 |
+
"print(\"Mul:\", a * b)\n",
|
| 94 |
+
"```\n",
|
| 95 |
+
"</details>"
|
| 96 |
+
]
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"cell_type": "markdown",
|
| 100 |
+
"metadata": {},
|
| 101 |
+
"source": [
|
| 102 |
+
"## 3. Indexing and Slicing\n",
|
| 103 |
+
"\n",
|
| 104 |
+
"### Task 3: Select Subsets\n",
|
| 105 |
+
"Create a 4x4 matrix and extract the middle 2x2 square."
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "code",
|
| 110 |
+
"execution_count": null,
|
| 111 |
+
"metadata": {},
|
| 112 |
+
"outputs": [],
|
| 113 |
+
"source": [
|
| 114 |
+
"mat = np.arange(16).reshape(4, 4)\n",
|
| 115 |
+
"print(\"Original:\\n\", mat)\n",
|
| 116 |
+
"\n",
|
| 117 |
+
"# YOUR CODE HERE\n"
|
| 118 |
+
]
|
| 119 |
+
},
|
| 120 |
+
{
|
| 121 |
+
"cell_type": "markdown",
|
| 122 |
+
"metadata": {},
|
| 123 |
+
"source": [
|
| 124 |
+
"<details>\n",
|
| 125 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 126 |
+
"\n",
|
| 127 |
+
"```python\n",
|
| 128 |
+
"middle = mat[1:3, 1:3]\n",
|
| 129 |
+
"print(\"Middle 2x2:\\n\", middle)\n",
|
| 130 |
+
"```\n",
|
| 131 |
+
"</details>"
|
| 132 |
+
]
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"cell_type": "markdown",
|
| 136 |
+
"metadata": {},
|
| 137 |
+
"source": [
|
| 138 |
+
"## 4. Statistics with NumPy\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"### Task 4: Aggregations\n",
|
| 141 |
+
"Calculate the mean, standard deviation, and sum of a random 100-element array."
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"cell_type": "code",
|
| 146 |
+
"execution_count": null,
|
| 147 |
+
"metadata": {},
|
| 148 |
+
"outputs": [],
|
| 149 |
+
"source": [
|
| 150 |
+
"data = np.random.randn(100)\n",
|
| 151 |
+
"\n",
|
| 152 |
+
"# YOUR CODE HERE\n"
|
| 153 |
+
]
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"cell_type": "markdown",
|
| 157 |
+
"metadata": {},
|
| 158 |
+
"source": [
|
| 159 |
+
"<details>\n",
|
| 160 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 161 |
+
"\n",
|
| 162 |
+
"```python\n",
|
| 163 |
+
"print(\"Mean:\", np.mean(data))\n",
|
| 164 |
+
"print(\"Std:\", np.std(data))\n",
|
| 165 |
+
"print(\"Sum:\", np.sum(data))\n",
|
| 166 |
+
"```\n",
|
| 167 |
+
"</details>"
|
| 168 |
+
]
|
| 169 |
+
},
|
| 170 |
+
{
|
| 171 |
+
"cell_type": "markdown",
|
| 172 |
+
"metadata": {},
|
| 173 |
+
"source": [
|
| 174 |
+
"--- \n",
|
| 175 |
+
"### Great NumPy Practice! \n",
|
| 176 |
+
"NumPy is the engine behind Pandas and Scikit-Learn. Mastering it makes everything else easier.\n",
|
| 177 |
+
"Next: **Pandas Practice**."
|
| 178 |
+
]
|
| 179 |
+
}
|
| 180 |
+
],
|
| 181 |
+
"metadata": {
|
| 182 |
+
"kernelspec": {
|
| 183 |
+
"display_name": "Python 3",
|
| 184 |
+
"language": "python",
|
| 185 |
+
"name": "python3"
|
| 186 |
+
},
|
| 187 |
+
"language_info": {
|
| 188 |
+
"codemirror_mode": {
|
| 189 |
+
"name": "ipython",
|
| 190 |
+
"version": 3
|
| 191 |
+
},
|
| 192 |
+
"file_extension": ".py",
|
| 193 |
+
"mimetype": "text/x-python",
|
| 194 |
+
"name": "python",
|
| 195 |
+
"nbconvert_exporter": "python",
|
| 196 |
+
"pygments_lexer": "ipython3",
|
| 197 |
+
"version": "3.12.7"
|
| 198 |
+
}
|
| 199 |
+
},
|
| 200 |
+
"nbformat": 4,
|
| 201 |
+
"nbformat_minor": 4
|
| 202 |
+
}
|
ML/04_Pandas_Practice.ipynb
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Python Library Practice: Pandas\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Pandas is the primary tool for data manipulation and analysis in Python. It provides data structures like `DataFrame` and `Series` that make working with tabular data easy.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Feature Engineering Guide](https://aashishgarg13.github.io/DataScience/feature-engineering/)** on your hub for data cleaning and transformation concepts using Pandas.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **DataFrame Creation**: Building dataframes from dictionaries.\n",
|
| 16 |
+
"2. **Selection & Filtering**: Querying data.\n",
|
| 17 |
+
"3. **Grouping & Aggregation**: Summarizing data.\n",
|
| 18 |
+
"4. **Handling Missing Data**: Methods to clean datasets.\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"---"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "markdown",
|
| 25 |
+
"metadata": {},
|
| 26 |
+
"source": [
|
| 27 |
+
"## 1. DataFrame Basics\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"### Task 1: Create a DataFrame\n",
|
| 30 |
+
"Create a DataFrame from a dictionary with columns: `Name`, `Age`, and `City` for 5 people."
|
| 31 |
+
]
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"cell_type": "code",
|
| 35 |
+
"execution_count": null,
|
| 36 |
+
"metadata": {},
|
| 37 |
+
"outputs": [],
|
| 38 |
+
"source": [
|
| 39 |
+
"import pandas as pd\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"# YOUR CODE HERE\n"
|
| 42 |
+
]
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"cell_type": "markdown",
|
| 46 |
+
"metadata": {},
|
| 47 |
+
"source": [
|
| 48 |
+
"<details>\n",
|
| 49 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"```python\n",
|
| 52 |
+
"data = {\n",
|
| 53 |
+
" 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],\n",
|
| 54 |
+
" 'Age': [24, 30, 22, 35, 29],\n",
|
| 55 |
+
" 'City': ['NY', 'LA', 'Chicago', 'Houston', 'Miami']\n",
|
| 56 |
+
"}\n",
|
| 57 |
+
"df = pd.DataFrame(data)\n",
|
| 58 |
+
"print(df)\n",
|
| 59 |
+
"```\n",
|
| 60 |
+
"</details>"
|
| 61 |
+
]
|
| 62 |
+
},
|
| 63 |
+
{
|
| 64 |
+
"cell_type": "markdown",
|
| 65 |
+
"metadata": {},
|
| 66 |
+
"source": [
|
| 67 |
+
"## 2. Selection and Filtering\n",
|
| 68 |
+
"\n",
|
| 69 |
+
"### Task 2: Conditional Selection\n",
|
| 70 |
+
"Using the DataFrame from Task 1, select all rows where `Age` is greater than 25."
|
| 71 |
+
]
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"cell_type": "code",
|
| 75 |
+
"execution_count": null,
|
| 76 |
+
"metadata": {},
|
| 77 |
+
"outputs": [],
|
| 78 |
+
"source": [
|
| 79 |
+
"# YOUR CODE HERE\n"
|
| 80 |
+
]
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"cell_type": "markdown",
|
| 84 |
+
"metadata": {},
|
| 85 |
+
"source": [
|
| 86 |
+
"<details>\n",
|
| 87 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 88 |
+
"\n",
|
| 89 |
+
"```python\n",
|
| 90 |
+
"filtered_df = df[df['Age'] > 25]\n",
|
| 91 |
+
"print(filtered_df)\n",
|
| 92 |
+
"```\n",
|
| 93 |
+
"</details>"
|
| 94 |
+
]
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"cell_type": "markdown",
|
| 98 |
+
"metadata": {},
|
| 99 |
+
"source": [
|
| 100 |
+
"## 3. GroupBy and Aggregation\n",
|
| 101 |
+
"\n",
|
| 102 |
+
"### Task 3: Grouping Data\n",
|
| 103 |
+
"Create a DataFrame with `Category` and `Sales`. Group by `Category` and calculate the average `Sales`."
|
| 104 |
+
]
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
"cell_type": "code",
|
| 108 |
+
"execution_count": null,
|
| 109 |
+
"metadata": {},
|
| 110 |
+
"outputs": [],
|
| 111 |
+
"source": [
|
| 112 |
+
"sales_data = {\n",
|
| 113 |
+
" 'Category': ['Electronics', 'Clothing', 'Electronics', 'Home', 'Clothing', 'Home'],\n",
|
| 114 |
+
" 'Sales': [100, 50, 200, 300, 40, 150]\n",
|
| 115 |
+
"}\n",
|
| 116 |
+
"sales_df = pd.DataFrame(sales_data)\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"# YOUR CODE HERE\n"
|
| 119 |
+
]
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"cell_type": "markdown",
|
| 123 |
+
"metadata": {},
|
| 124 |
+
"source": [
|
| 125 |
+
"<details>\n",
|
| 126 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 127 |
+
"\n",
|
| 128 |
+
"```python\n",
|
| 129 |
+
"result = sales_df.groupby('Category').mean()\n",
|
| 130 |
+
"print(result)\n",
|
| 131 |
+
"```\n",
|
| 132 |
+
"</details>"
|
| 133 |
+
]
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"cell_type": "markdown",
|
| 137 |
+
"metadata": {},
|
| 138 |
+
"source": [
|
| 139 |
+
"## 4. Merging and Joining\n",
|
| 140 |
+
"\n",
|
| 141 |
+
"### Task 4: Merge DataFrames\n",
|
| 142 |
+
"Merge two DataFrames on a common `ID` column."
|
| 143 |
+
]
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"cell_type": "code",
|
| 147 |
+
"execution_count": null,
|
| 148 |
+
"metadata": {},
|
| 149 |
+
"outputs": [],
|
| 150 |
+
"source": [
|
| 151 |
+
"df1 = pd.DataFrame({'ID': [1, 2, 3], 'Value1': ['A', 'B', 'C']})\n",
|
| 152 |
+
"df2 = pd.DataFrame({'ID': [2, 3, 4], 'Value2': ['X', 'Y', 'Z']})\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"# YOUR CODE HERE\n"
|
| 155 |
+
]
|
| 156 |
+
},
|
| 157 |
+
{
|
| 158 |
+
"cell_type": "markdown",
|
| 159 |
+
"metadata": {},
|
| 160 |
+
"source": [
|
| 161 |
+
"<details>\n",
|
| 162 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 163 |
+
"\n",
|
| 164 |
+
"```python\n",
|
| 165 |
+
"merged = pd.merge(df1, df2, on='ID', how='inner')\n",
|
| 166 |
+
"print(merged)\n",
|
| 167 |
+
"```\n",
|
| 168 |
+
"</details>"
|
| 169 |
+
]
|
| 170 |
+
},
|
| 171 |
+
{
|
| 172 |
+
"cell_type": "markdown",
|
| 173 |
+
"metadata": {},
|
| 174 |
+
"source": [
|
| 175 |
+
"--- \n",
|
| 176 |
+
"### Excellent Pandas Practice! \n",
|
| 177 |
+
"You're becoming a data manipulator pro.\n",
|
| 178 |
+
"Next: **Matplotlib & Seaborn Practice**."
|
| 179 |
+
]
|
| 180 |
+
}
|
| 181 |
+
],
|
| 182 |
+
"metadata": {
|
| 183 |
+
"kernelspec": {
|
| 184 |
+
"display_name": "Python 3",
|
| 185 |
+
"language": "python",
|
| 186 |
+
"name": "python3"
|
| 187 |
+
},
|
| 188 |
+
"language_info": {
|
| 189 |
+
"codemirror_mode": {
|
| 190 |
+
"name": "ipython",
|
| 191 |
+
"version": 3
|
| 192 |
+
},
|
| 193 |
+
"file_extension": ".py",
|
| 194 |
+
"mimetype": "text/x-python",
|
| 195 |
+
"name": "python",
|
| 196 |
+
"nbconvert_exporter": "python",
|
| 197 |
+
"pygments_lexer": "ipython3",
|
| 198 |
+
"version": "3.12.7"
|
| 199 |
+
}
|
| 200 |
+
},
|
| 201 |
+
"nbformat": 4,
|
| 202 |
+
"nbformat_minor": 4
|
| 203 |
+
}
|
ML/05_Matplotlib_Seaborn_Practice.ipynb
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Python Library Practice: Matplotlib & Seaborn\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Data visualization is the key to understanding complex datasets. Matplotlib provides the low-level building blocks, while Seaborn offers beautiful high-level statistical plots.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Data Visualization](https://aashishgarg13.github.io/DataScience/Visualization/)** section on your hub for examples of interactive charts and best practices.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Line & Scatter Plots**: Basic time series and correlation visuals.\n",
|
| 16 |
+
"2. **Distribution Plots**: Histograms and Box plots.\n",
|
| 17 |
+
"3. **Categorical Plots**: Bar charts and Count plots.\n",
|
| 18 |
+
"4. **Customization**: Adding titles, labels, and styles.\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"---"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "markdown",
|
| 25 |
+
"metadata": {},
|
| 26 |
+
"source": [
|
| 27 |
+
"## 1. Line and Scatter Plots\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"### Task 1: Basic Line Plot\n",
|
| 30 |
+
"Plot the function $y = x^2$ for $x$ values between -10 and 10."
|
| 31 |
+
]
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"cell_type": "code",
|
| 35 |
+
"execution_count": null,
|
| 36 |
+
"metadata": {},
|
| 37 |
+
"outputs": [],
|
| 38 |
+
"source": [
|
| 39 |
+
"import matplotlib.pyplot as plt\n",
|
| 40 |
+
"import numpy as np\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# YOUR CODE HERE\n"
|
| 43 |
+
]
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"cell_type": "markdown",
|
| 47 |
+
"metadata": {},
|
| 48 |
+
"source": [
|
| 49 |
+
"<details>\n",
|
| 50 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 51 |
+
"\n",
|
| 52 |
+
"```python\n",
|
| 53 |
+
"x = np.linspace(-10, 10, 100)\n",
|
| 54 |
+
"y = x**2\n",
|
| 55 |
+
"plt.plot(x, y)\n",
|
| 56 |
+
"plt.title(\"Plot of $y=x^2$\")\n",
|
| 57 |
+
"plt.xlabel(\"x\")\n",
|
| 58 |
+
"plt.ylabel(\"y\")\n",
|
| 59 |
+
"plt.show()\n",
|
| 60 |
+
"```\n",
|
| 61 |
+
"</details>"
|
| 62 |
+
]
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"cell_type": "markdown",
|
| 66 |
+
"metadata": {},
|
| 67 |
+
"source": [
|
| 68 |
+
"## 2. Statistical Distributions\n",
|
| 69 |
+
"\n",
|
| 70 |
+
"### Task 2: Histogram and BoxPlot\n",
|
| 71 |
+
"Generate 500 random points from a normal distribution and plot their histogram and boxplot side-by-side using Seaborn."
|
| 72 |
+
]
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"cell_type": "code",
|
| 76 |
+
"execution_count": null,
|
| 77 |
+
"metadata": {},
|
| 78 |
+
"outputs": [],
|
| 79 |
+
"source": [
|
| 80 |
+
"import seaborn as sns\n",
|
| 81 |
+
"data = np.random.normal(0, 1, 500)\n",
|
| 82 |
+
"\n",
|
| 83 |
+
"# YOUR CODE HERE\n"
|
| 84 |
+
]
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"cell_type": "markdown",
|
| 88 |
+
"metadata": {},
|
| 89 |
+
"source": [
|
| 90 |
+
"<details>\n",
|
| 91 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 92 |
+
"\n",
|
| 93 |
+
"```python\n",
|
| 94 |
+
"plt.figure(figsize=(12, 5))\n",
|
| 95 |
+
"plt.subplot(1, 2, 1)\n",
|
| 96 |
+
"sns.histplot(data, kde=True)\n",
|
| 97 |
+
"plt.title(\"Histogram\")\n",
|
| 98 |
+
"\n",
|
| 99 |
+
"plt.subplot(1, 2, 2)\n",
|
| 100 |
+
"sns.boxplot(y=data)\n",
|
| 101 |
+
"plt.title(\"Boxplot\")\n",
|
| 102 |
+
"plt.show()\n",
|
| 103 |
+
"```\n",
|
| 104 |
+
"</details>"
|
| 105 |
+
]
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"cell_type": "markdown",
|
| 109 |
+
"metadata": {},
|
| 110 |
+
"source": [
|
| 111 |
+
"## 3. Categorical Data Visuals\n",
|
| 112 |
+
"\n",
|
| 113 |
+
"### Task 3: Bar Chart\n",
|
| 114 |
+
"Using the `tips` dataset from Seaborn, plot the average total bill for each day of the week."
|
| 115 |
+
]
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"cell_type": "code",
|
| 119 |
+
"execution_count": null,
|
| 120 |
+
"metadata": {},
|
| 121 |
+
"outputs": [],
|
| 122 |
+
"source": [
|
| 123 |
+
"tips = sns.load_dataset('tips')\n",
|
| 124 |
+
"\n",
|
| 125 |
+
"# YOUR CODE HERE\n"
|
| 126 |
+
]
|
| 127 |
+
},
|
| 128 |
+
{
|
| 129 |
+
"cell_type": "markdown",
|
| 130 |
+
"metadata": {},
|
| 131 |
+
"source": [
|
| 132 |
+
"<details>\n",
|
| 133 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"```python\n",
|
| 136 |
+
"sns.barplot(x='day', y='total_bill', data=tips)\n",
|
| 137 |
+
"plt.title(\"Average Total Bill by Day\")\n",
|
| 138 |
+
"plt.show()\n",
|
| 139 |
+
"```\n",
|
| 140 |
+
"</details>"
|
| 141 |
+
]
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"cell_type": "markdown",
|
| 145 |
+
"metadata": {},
|
| 146 |
+
"source": [
|
| 147 |
+
"## 4. Relationship Exploration\n",
|
| 148 |
+
"\n",
|
| 149 |
+
"### Task 4: Pair Plot\n",
|
| 150 |
+
"Plot pairwise relationships in the `iris` dataset, colored by species."
|
| 151 |
+
]
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"cell_type": "code",
|
| 155 |
+
"execution_count": null,
|
| 156 |
+
"metadata": {},
|
| 157 |
+
"outputs": [],
|
| 158 |
+
"source": [
|
| 159 |
+
"iris = sns.load_dataset('iris')\n",
|
| 160 |
+
"\n",
|
| 161 |
+
"# YOUR CODE HERE\n"
|
| 162 |
+
]
|
| 163 |
+
},
|
| 164 |
+
{
|
| 165 |
+
"cell_type": "markdown",
|
| 166 |
+
"metadata": {},
|
| 167 |
+
"source": [
|
| 168 |
+
"<details>\n",
|
| 169 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 170 |
+
"\n",
|
| 171 |
+
"```python\n",
|
| 172 |
+
"sns.pairplot(iris, hue='species')\n",
|
| 173 |
+
"plt.show()\n",
|
| 174 |
+
"```\n",
|
| 175 |
+
"</details>"
|
| 176 |
+
]
|
| 177 |
+
},
|
| 178 |
+
{
|
| 179 |
+
"cell_type": "markdown",
|
| 180 |
+
"metadata": {},
|
| 181 |
+
"source": [
|
| 182 |
+
"--- \n",
|
| 183 |
+
"### Great Visualization Practice! \n",
|
| 184 |
+
"A picture is worth a thousand rows. \n",
|
| 185 |
+
"Next: **Scikit-Learn practice**."
|
| 186 |
+
]
|
| 187 |
+
}
|
| 188 |
+
],
|
| 189 |
+
"metadata": {
|
| 190 |
+
"kernelspec": {
|
| 191 |
+
"display_name": "Python 3",
|
| 192 |
+
"language": "python",
|
| 193 |
+
"name": "python3"
|
| 194 |
+
},
|
| 195 |
+
"language_info": {
|
| 196 |
+
"codemirror_mode": {
|
| 197 |
+
"name": "ipython",
|
| 198 |
+
"version": 3
|
| 199 |
+
},
|
| 200 |
+
"file_extension": ".py",
|
| 201 |
+
"mimetype": "text/x-python",
|
| 202 |
+
"name": "python",
|
| 203 |
+
"nbconvert_exporter": "python",
|
| 204 |
+
"pygments_lexer": "ipython3",
|
| 205 |
+
"version": "3.12.7"
|
| 206 |
+
}
|
| 207 |
+
},
|
| 208 |
+
"nbformat": 4,
|
| 209 |
+
"nbformat_minor": 4
|
| 210 |
+
}
|
ML/06_EDA_and_Feature_Engineering.ipynb
ADDED
|
@@ -0,0 +1,449 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 01 - EDA & Feature Engineering\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to the first module of your Machine Learning practice! \n",
|
| 10 |
+
"\n",
|
| 11 |
+
"In this notebook, we will focus on the most critical part of the ML pipeline: **Understanding and Preparing your data.**\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"### Resources:\n",
|
| 14 |
+
"This practice guide is integrated with your [DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/). Specifically, you can refer to the **Feature Engineering Guide** section on the website for interactive visual explanations of these concepts.\n",
|
| 15 |
+
"\n",
|
| 16 |
+
"### Objectives:\n",
|
| 17 |
+
"1. **EDA**: Visualize distributions, correlations, and outliers.\n",
|
| 18 |
+
"2. **Data Cleaning**: Handle missing values and data inconsistencies.\n",
|
| 19 |
+
"3. **Feature Engineering**: Create new features and transform existing ones (Encoding, Scaling).\n",
|
| 20 |
+
"\n",
|
| 21 |
+
"---"
|
| 22 |
+
]
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"cell_type": "markdown",
|
| 26 |
+
"metadata": {},
|
| 27 |
+
"source": [
|
| 28 |
+
"## 1. Environment Setup\n",
|
| 29 |
+
"First, let's load the necessary libraries and the dataset. We'll use the **Titanic Dataset** for this exercise."
|
| 30 |
+
]
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"cell_type": "code",
|
| 34 |
+
"execution_count": 1,
|
| 35 |
+
"metadata": {},
|
| 36 |
+
"outputs": [
|
| 37 |
+
{
|
| 38 |
+
"name": "stdout",
|
| 39 |
+
"output_type": "stream",
|
| 40 |
+
"text": [
|
| 41 |
+
"Dataset Shape: (891, 15)\n"
|
| 42 |
+
]
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"data": {
|
| 46 |
+
"text/html": [
|
| 47 |
+
"<div>\n",
|
| 48 |
+
"<style scoped>\n",
|
| 49 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 50 |
+
" vertical-align: middle;\n",
|
| 51 |
+
" }\n",
|
| 52 |
+
"\n",
|
| 53 |
+
" .dataframe tbody tr th {\n",
|
| 54 |
+
" vertical-align: top;\n",
|
| 55 |
+
" }\n",
|
| 56 |
+
"\n",
|
| 57 |
+
" .dataframe thead th {\n",
|
| 58 |
+
" text-align: right;\n",
|
| 59 |
+
" }\n",
|
| 60 |
+
"</style>\n",
|
| 61 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 62 |
+
" <thead>\n",
|
| 63 |
+
" <tr style=\"text-align: right;\">\n",
|
| 64 |
+
" <th></th>\n",
|
| 65 |
+
" <th>survived</th>\n",
|
| 66 |
+
" <th>pclass</th>\n",
|
| 67 |
+
" <th>sex</th>\n",
|
| 68 |
+
" <th>age</th>\n",
|
| 69 |
+
" <th>sibsp</th>\n",
|
| 70 |
+
" <th>parch</th>\n",
|
| 71 |
+
" <th>fare</th>\n",
|
| 72 |
+
" <th>embarked</th>\n",
|
| 73 |
+
" <th>class</th>\n",
|
| 74 |
+
" <th>who</th>\n",
|
| 75 |
+
" <th>adult_male</th>\n",
|
| 76 |
+
" <th>deck</th>\n",
|
| 77 |
+
" <th>embark_town</th>\n",
|
| 78 |
+
" <th>alive</th>\n",
|
| 79 |
+
" <th>alone</th>\n",
|
| 80 |
+
" </tr>\n",
|
| 81 |
+
" </thead>\n",
|
| 82 |
+
" <tbody>\n",
|
| 83 |
+
" <tr>\n",
|
| 84 |
+
" <th>0</th>\n",
|
| 85 |
+
" <td>0</td>\n",
|
| 86 |
+
" <td>3</td>\n",
|
| 87 |
+
" <td>male</td>\n",
|
| 88 |
+
" <td>22.0</td>\n",
|
| 89 |
+
" <td>1</td>\n",
|
| 90 |
+
" <td>0</td>\n",
|
| 91 |
+
" <td>7.2500</td>\n",
|
| 92 |
+
" <td>S</td>\n",
|
| 93 |
+
" <td>Third</td>\n",
|
| 94 |
+
" <td>man</td>\n",
|
| 95 |
+
" <td>True</td>\n",
|
| 96 |
+
" <td>NaN</td>\n",
|
| 97 |
+
" <td>Southampton</td>\n",
|
| 98 |
+
" <td>no</td>\n",
|
| 99 |
+
" <td>False</td>\n",
|
| 100 |
+
" </tr>\n",
|
| 101 |
+
" <tr>\n",
|
| 102 |
+
" <th>1</th>\n",
|
| 103 |
+
" <td>1</td>\n",
|
| 104 |
+
" <td>1</td>\n",
|
| 105 |
+
" <td>female</td>\n",
|
| 106 |
+
" <td>38.0</td>\n",
|
| 107 |
+
" <td>1</td>\n",
|
| 108 |
+
" <td>0</td>\n",
|
| 109 |
+
" <td>71.2833</td>\n",
|
| 110 |
+
" <td>C</td>\n",
|
| 111 |
+
" <td>First</td>\n",
|
| 112 |
+
" <td>woman</td>\n",
|
| 113 |
+
" <td>False</td>\n",
|
| 114 |
+
" <td>C</td>\n",
|
| 115 |
+
" <td>Cherbourg</td>\n",
|
| 116 |
+
" <td>yes</td>\n",
|
| 117 |
+
" <td>False</td>\n",
|
| 118 |
+
" </tr>\n",
|
| 119 |
+
" <tr>\n",
|
| 120 |
+
" <th>2</th>\n",
|
| 121 |
+
" <td>1</td>\n",
|
| 122 |
+
" <td>3</td>\n",
|
| 123 |
+
" <td>female</td>\n",
|
| 124 |
+
" <td>26.0</td>\n",
|
| 125 |
+
" <td>0</td>\n",
|
| 126 |
+
" <td>0</td>\n",
|
| 127 |
+
" <td>7.9250</td>\n",
|
| 128 |
+
" <td>S</td>\n",
|
| 129 |
+
" <td>Third</td>\n",
|
| 130 |
+
" <td>woman</td>\n",
|
| 131 |
+
" <td>False</td>\n",
|
| 132 |
+
" <td>NaN</td>\n",
|
| 133 |
+
" <td>Southampton</td>\n",
|
| 134 |
+
" <td>yes</td>\n",
|
| 135 |
+
" <td>True</td>\n",
|
| 136 |
+
" </tr>\n",
|
| 137 |
+
" <tr>\n",
|
| 138 |
+
" <th>3</th>\n",
|
| 139 |
+
" <td>1</td>\n",
|
| 140 |
+
" <td>1</td>\n",
|
| 141 |
+
" <td>female</td>\n",
|
| 142 |
+
" <td>35.0</td>\n",
|
| 143 |
+
" <td>1</td>\n",
|
| 144 |
+
" <td>0</td>\n",
|
| 145 |
+
" <td>53.1000</td>\n",
|
| 146 |
+
" <td>S</td>\n",
|
| 147 |
+
" <td>First</td>\n",
|
| 148 |
+
" <td>woman</td>\n",
|
| 149 |
+
" <td>False</td>\n",
|
| 150 |
+
" <td>C</td>\n",
|
| 151 |
+
" <td>Southampton</td>\n",
|
| 152 |
+
" <td>yes</td>\n",
|
| 153 |
+
" <td>False</td>\n",
|
| 154 |
+
" </tr>\n",
|
| 155 |
+
" <tr>\n",
|
| 156 |
+
" <th>4</th>\n",
|
| 157 |
+
" <td>0</td>\n",
|
| 158 |
+
" <td>3</td>\n",
|
| 159 |
+
" <td>male</td>\n",
|
| 160 |
+
" <td>35.0</td>\n",
|
| 161 |
+
" <td>0</td>\n",
|
| 162 |
+
" <td>0</td>\n",
|
| 163 |
+
" <td>8.0500</td>\n",
|
| 164 |
+
" <td>S</td>\n",
|
| 165 |
+
" <td>Third</td>\n",
|
| 166 |
+
" <td>man</td>\n",
|
| 167 |
+
" <td>True</td>\n",
|
| 168 |
+
" <td>NaN</td>\n",
|
| 169 |
+
" <td>Southampton</td>\n",
|
| 170 |
+
" <td>no</td>\n",
|
| 171 |
+
" <td>True</td>\n",
|
| 172 |
+
" </tr>\n",
|
| 173 |
+
" </tbody>\n",
|
| 174 |
+
"</table>\n",
|
| 175 |
+
"</div>"
|
| 176 |
+
],
|
| 177 |
+
"text/plain": [
|
| 178 |
+
" survived pclass sex age sibsp parch fare embarked class \\\n",
|
| 179 |
+
"0 0 3 male 22.0 1 0 7.2500 S Third \n",
|
| 180 |
+
"1 1 1 female 38.0 1 0 71.2833 C First \n",
|
| 181 |
+
"2 1 3 female 26.0 0 0 7.9250 S Third \n",
|
| 182 |
+
"3 1 1 female 35.0 1 0 53.1000 S First \n",
|
| 183 |
+
"4 0 3 male 35.0 0 0 8.0500 S Third \n",
|
| 184 |
+
"\n",
|
| 185 |
+
" who adult_male deck embark_town alive alone \n",
|
| 186 |
+
"0 man True NaN Southampton no False \n",
|
| 187 |
+
"1 woman False C Cherbourg yes False \n",
|
| 188 |
+
"2 woman False NaN Southampton yes True \n",
|
| 189 |
+
"3 woman False C Southampton yes False \n",
|
| 190 |
+
"4 man True NaN Southampton no True "
|
| 191 |
+
]
|
| 192 |
+
},
|
| 193 |
+
"execution_count": 1,
|
| 194 |
+
"metadata": {},
|
| 195 |
+
"output_type": "execute_result"
|
| 196 |
+
}
|
| 197 |
+
],
|
| 198 |
+
"source": [
|
| 199 |
+
"import pandas as pd\n",
|
| 200 |
+
"import numpy as np\n",
|
| 201 |
+
"import matplotlib.pyplot as plt\n",
|
| 202 |
+
"import seaborn as sns\n",
|
| 203 |
+
"\n",
|
| 204 |
+
"# Load dataset\n",
|
| 205 |
+
"df = sns.load_dataset('titanic')\n",
|
| 206 |
+
"print(\"Dataset Shape:\", df.shape)\n",
|
| 207 |
+
"df.head()"
|
| 208 |
+
]
|
| 209 |
+
},
|
| 210 |
+
{
|
| 211 |
+
"cell_type": "markdown",
|
| 212 |
+
"metadata": {},
|
| 213 |
+
"source": [
|
| 214 |
+
"## 2. Part 1: Exploratory Data Analysis (EDA)\n",
|
| 215 |
+
"\n",
|
| 216 |
+
"### Task 1: Basic Statistics and Info\n",
|
| 217 |
+
"Check the data types, non-null counts, and summary statistics."
|
| 218 |
+
]
|
| 219 |
+
},
|
| 220 |
+
{
|
| 221 |
+
"cell_type": "code",
|
| 222 |
+
"execution_count": null,
|
| 223 |
+
"metadata": {},
|
| 224 |
+
"outputs": [],
|
| 225 |
+
"source": [
|
| 226 |
+
"# YOUR CODE HERE\n"
|
| 227 |
+
]
|
| 228 |
+
},
|
| 229 |
+
{
|
| 230 |
+
"cell_type": "markdown",
|
| 231 |
+
"metadata": {},
|
| 232 |
+
"source": [
|
| 233 |
+
"<details>\n",
|
| 234 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 235 |
+
"\n",
|
| 236 |
+
"```python\n",
|
| 237 |
+
"print(df.info())\n",
|
| 238 |
+
"print(df.describe())\n",
|
| 239 |
+
"```\n",
|
| 240 |
+
"</details>"
|
| 241 |
+
]
|
| 242 |
+
},
|
| 243 |
+
{
|
| 244 |
+
"cell_type": "markdown",
|
| 245 |
+
"metadata": {},
|
| 246 |
+
"source": [
|
| 247 |
+
"### Task 2: Missing Value Analysis\n",
|
| 248 |
+
"Find the percentage of missing values in each column."
|
| 249 |
+
]
|
| 250 |
+
},
|
| 251 |
+
{
|
| 252 |
+
"cell_type": "code",
|
| 253 |
+
"execution_count": null,
|
| 254 |
+
"metadata": {},
|
| 255 |
+
"outputs": [],
|
| 256 |
+
"source": [
|
| 257 |
+
"# YOUR CODE HERE\n"
|
| 258 |
+
]
|
| 259 |
+
},
|
| 260 |
+
{
|
| 261 |
+
"cell_type": "markdown",
|
| 262 |
+
"metadata": {},
|
| 263 |
+
"source": [
|
| 264 |
+
"<details>\n",
|
| 265 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 266 |
+
"\n",
|
| 267 |
+
"```python\n",
|
| 268 |
+
"missing_pct = (df.isnull().sum() / len(df)) * 100\n",
|
| 269 |
+
"print(missing_pct)\n",
|
| 270 |
+
"```\n",
|
| 271 |
+
"</details>"
|
| 272 |
+
]
|
| 273 |
+
},
|
| 274 |
+
{
|
| 275 |
+
"cell_type": "markdown",
|
| 276 |
+
"metadata": {},
|
| 277 |
+
"source": [
|
| 278 |
+
"### Task 3: Visualizing Distributions\n",
|
| 279 |
+
"Plot the distribution of `age` and the count of `survived`."
|
| 280 |
+
]
|
| 281 |
+
},
|
| 282 |
+
{
|
| 283 |
+
"cell_type": "code",
|
| 284 |
+
"execution_count": null,
|
| 285 |
+
"metadata": {},
|
| 286 |
+
"outputs": [],
|
| 287 |
+
"source": [
|
| 288 |
+
"# YOUR CODE HERE\n"
|
| 289 |
+
]
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"cell_type": "markdown",
|
| 293 |
+
"metadata": {},
|
| 294 |
+
"source": [
|
| 295 |
+
"<details>\n",
|
| 296 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 297 |
+
"\n",
|
| 298 |
+
"```python\n",
|
| 299 |
+
"plt.figure(figsize=(12, 5))\n",
|
| 300 |
+
"plt.subplot(1, 2, 1)\n",
|
| 301 |
+
"sns.histplot(df['age'].dropna(), kde=True)\n",
|
| 302 |
+
"plt.title('Age Distribution')\n",
|
| 303 |
+
"\n",
|
| 304 |
+
"plt.subplot(1, 2, 2)\n",
|
| 305 |
+
"sns.countplot(x='survived', data=df)\n",
|
| 306 |
+
"plt.title('Survival Count')\n",
|
| 307 |
+
"plt.show()\n",
|
| 308 |
+
"```\n",
|
| 309 |
+
"</details>"
|
| 310 |
+
]
|
| 311 |
+
},
|
| 312 |
+
{
|
| 313 |
+
"cell_type": "markdown",
|
| 314 |
+
"metadata": {},
|
| 315 |
+
"source": [
|
| 316 |
+
"## 3. Part 2: Data Cleaning\n",
|
| 317 |
+
"\n",
|
| 318 |
+
"### Task 4: Handling Missing Values\n",
|
| 319 |
+
"1. Fill missing `age` values with the median.\n",
|
| 320 |
+
"2. Fill missing `embarked` values with the mode.\n",
|
| 321 |
+
"3. Drop the `deck` column as it has too many missing values.\n",
|
| 322 |
+
"\n",
|
| 323 |
+
"*Hint: Visit the [Feature Engineering Guide - Missing Data](https://aashishgarg13.github.io/DataScience/feature-engineering/#missing-data) to see visual differences between Mean, Median, and KNN imputation.*"
|
| 324 |
+
]
|
| 325 |
+
},
|
| 326 |
+
{
|
| 327 |
+
"cell_type": "code",
|
| 328 |
+
"execution_count": null,
|
| 329 |
+
"metadata": {},
|
| 330 |
+
"outputs": [],
|
| 331 |
+
"source": [
|
| 332 |
+
"# YOUR CODE HERE\n"
|
| 333 |
+
]
|
| 334 |
+
},
|
| 335 |
+
{
|
| 336 |
+
"cell_type": "markdown",
|
| 337 |
+
"metadata": {},
|
| 338 |
+
"source": [
|
| 339 |
+
"<details>\n",
|
| 340 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 341 |
+
"\n",
|
| 342 |
+
"```python\n",
|
| 343 |
+
"df['age'] = df['age'].fillna(df['age'].median())\n",
|
| 344 |
+
"df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])\n",
|
| 345 |
+
"df.drop('deck', axis=1, inplace=True)\n",
|
| 346 |
+
"print(\"Missing values after cleaning:\\n\", df.isnull().sum())\n",
|
| 347 |
+
"```\n",
|
| 348 |
+
"</details>"
|
| 349 |
+
]
|
| 350 |
+
},
|
| 351 |
+
{
|
| 352 |
+
"cell_type": "markdown",
|
| 353 |
+
"metadata": {},
|
| 354 |
+
"source": [
|
| 355 |
+
"## 4. Part 3: Feature Engineering\n",
|
| 356 |
+
"\n",
|
| 357 |
+
"### Task 5: Creating New Features\n",
|
| 358 |
+
"Create a new column `family_size` by adding `sibsp` and `parch` (plus 1 for the passenger themselves)."
|
| 359 |
+
]
|
| 360 |
+
},
|
| 361 |
+
{
|
| 362 |
+
"cell_type": "code",
|
| 363 |
+
"execution_count": null,
|
| 364 |
+
"metadata": {},
|
| 365 |
+
"outputs": [],
|
| 366 |
+
"source": [
|
| 367 |
+
"# YOUR CODE HERE\n"
|
| 368 |
+
]
|
| 369 |
+
},
|
| 370 |
+
{
|
| 371 |
+
"cell_type": "markdown",
|
| 372 |
+
"metadata": {},
|
| 373 |
+
"source": [
|
| 374 |
+
"<details>\n",
|
| 375 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 376 |
+
"\n",
|
| 377 |
+
"```python\n",
|
| 378 |
+
"df['family_size'] = df['sibsp'] + df['parch'] + 1\n",
|
| 379 |
+
"df[['sibsp', 'parch', 'family_size']].head()\n",
|
| 380 |
+
"```\n",
|
| 381 |
+
"</details>"
|
| 382 |
+
]
|
| 383 |
+
},
|
| 384 |
+
{
|
| 385 |
+
"cell_type": "markdown",
|
| 386 |
+
"metadata": {},
|
| 387 |
+
"source": [
|
| 388 |
+
"### Task 6: Encoding Categorical Variables\n",
|
| 389 |
+
"Convert `sex` and `embarked` into numerical values using One-Hot Encoding.\n",
|
| 390 |
+
"\n",
|
| 391 |
+
"*Hint: Learn about Label vs One-Hot Encoding in the [Encoding Section](https://aashishgarg13.github.io/DataScience/feature-engineering/#encoding) of your learning hub.*"
|
| 392 |
+
]
|
| 393 |
+
},
|
| 394 |
+
{
|
| 395 |
+
"cell_type": "code",
|
| 396 |
+
"execution_count": null,
|
| 397 |
+
"metadata": {},
|
| 398 |
+
"outputs": [],
|
| 399 |
+
"source": [
|
| 400 |
+
"# YOUR CODE HERE\n"
|
| 401 |
+
]
|
| 402 |
+
},
|
| 403 |
+
{
|
| 404 |
+
"cell_type": "markdown",
|
| 405 |
+
"metadata": {},
|
| 406 |
+
"source": [
|
| 407 |
+
"<details>\n",
|
| 408 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 409 |
+
"\n",
|
| 410 |
+
"```python\n",
|
| 411 |
+
"df = pd.get_dummies(df, columns=['sex', 'embarked'], drop_first=True)\n",
|
| 412 |
+
"df.head()\n",
|
| 413 |
+
"```\n",
|
| 414 |
+
"</details>"
|
| 415 |
+
]
|
| 416 |
+
},
|
| 417 |
+
{
|
| 418 |
+
"cell_type": "markdown",
|
| 419 |
+
"metadata": {},
|
| 420 |
+
"source": [
|
| 421 |
+
"--- \n",
|
| 422 |
+
"### Great Job! \n",
|
| 423 |
+
"You have completed the EDA and Feature Engineering module. \n",
|
| 424 |
+
"In the next module, we will apply **Linear Regression** to predict a continuous variable."
|
| 425 |
+
]
|
| 426 |
+
}
|
| 427 |
+
],
|
| 428 |
+
"metadata": {
|
| 429 |
+
"kernelspec": {
|
| 430 |
+
"display_name": "base",
|
| 431 |
+
"language": "python",
|
| 432 |
+
"name": "python3"
|
| 433 |
+
},
|
| 434 |
+
"language_info": {
|
| 435 |
+
"codemirror_mode": {
|
| 436 |
+
"name": "ipython",
|
| 437 |
+
"version": 3
|
| 438 |
+
},
|
| 439 |
+
"file_extension": ".py",
|
| 440 |
+
"mimetype": "text/x-python",
|
| 441 |
+
"name": "python",
|
| 442 |
+
"nbconvert_exporter": "python",
|
| 443 |
+
"pygments_lexer": "ipython3",
|
| 444 |
+
"version": "3.12.7"
|
| 445 |
+
}
|
| 446 |
+
},
|
| 447 |
+
"nbformat": 4,
|
| 448 |
+
"nbformat_minor": 4
|
| 449 |
+
}
|
ML/07_Scikit_Learn_Practice.ipynb
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Python Library Practice: Scikit-Learn (Utilities)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"While we've covered many algorithms, Scikit-Learn also provides vital utilities for data splitting, pipelines, and hyperparameter tuning.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Machine Learning Guide](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for conceptual workflows of cross-validation and preprocessing.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Train-Test Split**: Dividing data for validation.\n",
|
| 16 |
+
"2. **Pipelines**: Chaining preprocessing and modeling.\n",
|
| 17 |
+
"3. **Cross-Validation**: Robust model evaluation.\n",
|
| 18 |
+
"4. **Grid Search**: Automated hyperparameter tuning.\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"---"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "markdown",
|
| 25 |
+
"metadata": {},
|
| 26 |
+
"source": [
|
| 27 |
+
"## 1. Data Splitting\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"### Task 1: Scaled Split\n",
|
| 30 |
+
"Using the provided data, split it into 70% train and 30% test, ensuring the split is reproducible."
|
| 31 |
+
]
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"cell_type": "code",
|
| 35 |
+
"execution_count": null,
|
| 36 |
+
"metadata": {},
|
| 37 |
+
"outputs": [],
|
| 38 |
+
"source": [
|
| 39 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 40 |
+
"from sklearn.datasets import make_classification\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"X, y = make_classification(n_samples=1000, n_features=10, random_state=42)\n",
|
| 43 |
+
"\n",
|
| 44 |
+
"# YOUR CODE HERE\n"
|
| 45 |
+
]
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"cell_type": "markdown",
|
| 49 |
+
"metadata": {},
|
| 50 |
+
"source": [
|
| 51 |
+
"<details>\n",
|
| 52 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"```python\n",
|
| 55 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n",
|
| 56 |
+
"print(f\"Train size: {len(X_train)}, Test size: {len(X_test)}\")\n",
|
| 57 |
+
"```\n",
|
| 58 |
+
"</details>"
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"cell_type": "markdown",
|
| 63 |
+
"metadata": {},
|
| 64 |
+
"source": [
|
| 65 |
+
"## 2. Model Pipelines\n",
|
| 66 |
+
"\n",
|
| 67 |
+
"### Task 2: Create a Pipeline\n",
|
| 68 |
+
"Build a pipeline that combines `StandardScaler` and `LogisticRegression`."
|
| 69 |
+
]
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"cell_type": "code",
|
| 73 |
+
"execution_count": null,
|
| 74 |
+
"metadata": {},
|
| 75 |
+
"outputs": [],
|
| 76 |
+
"source": [
|
| 77 |
+
"from sklearn.pipeline import Pipeline\n",
|
| 78 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
| 79 |
+
"from sklearn.linear_model import LogisticRegression\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"# YOUR CODE HERE\n"
|
| 82 |
+
]
|
| 83 |
+
},
|
| 84 |
+
{
|
| 85 |
+
"cell_type": "markdown",
|
| 86 |
+
"metadata": {},
|
| 87 |
+
"source": [
|
| 88 |
+
"<details>\n",
|
| 89 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 90 |
+
"\n",
|
| 91 |
+
"```python\n",
|
| 92 |
+
"pipeline = Pipeline([\n",
|
| 93 |
+
" ('scaler', StandardScaler()),\n",
|
| 94 |
+
" ('model', LogisticRegression())\n",
|
| 95 |
+
"])\n",
|
| 96 |
+
"pipeline.fit(X_train, y_train)\n",
|
| 97 |
+
"print(\"Model Score:\", pipeline.score(X_test, y_test))\n",
|
| 98 |
+
"```\n",
|
| 99 |
+
"</details>"
|
| 100 |
+
]
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"cell_type": "markdown",
|
| 104 |
+
"metadata": {},
|
| 105 |
+
"source": [
|
| 106 |
+
"## 3. Cross-Validation\n",
|
| 107 |
+
"\n",
|
| 108 |
+
"### Task 3: 5-Fold Evaluation\n",
|
| 109 |
+
"Evaluate a `RandomForestClassifier` using 5-fold cross-validation."
|
| 110 |
+
]
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"cell_type": "code",
|
| 114 |
+
"execution_count": null,
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"outputs": [],
|
| 117 |
+
"source": [
|
| 118 |
+
"from sklearn.model_selection import cross_val_score\n",
|
| 119 |
+
"from sklearn.ensemble import RandomForestClassifier\n",
|
| 120 |
+
"\n",
|
| 121 |
+
"rf = RandomForestClassifier(n_estimators=100)\n",
|
| 122 |
+
"\n",
|
| 123 |
+
"# YOUR CODE HERE\n"
|
| 124 |
+
]
|
| 125 |
+
},
|
| 126 |
+
{
|
| 127 |
+
"cell_type": "markdown",
|
| 128 |
+
"metadata": {},
|
| 129 |
+
"source": [
|
| 130 |
+
"<details>\n",
|
| 131 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 132 |
+
"\n",
|
| 133 |
+
"```python\n",
|
| 134 |
+
"scores = cross_val_score(rf, X, y, cv=5)\n",
|
| 135 |
+
"print(\"Cross-validation scores:\", scores)\n",
|
| 136 |
+
"print(\"Mean accuracy:\", scores.mean())\n",
|
| 137 |
+
"```\n",
|
| 138 |
+
"</details>"
|
| 139 |
+
]
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"cell_type": "markdown",
|
| 143 |
+
"metadata": {},
|
| 144 |
+
"source": [
|
| 145 |
+
"## 4. Hyperparameter Tuning\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"### Task 4: Grid Search\n",
|
| 148 |
+
"Use `GridSearchCV` to find the best `max_depth` (3, 5, 10, None) for a Decision Tree."
|
| 149 |
+
]
|
| 150 |
+
},
|
| 151 |
+
{
|
| 152 |
+
"cell_type": "code",
|
| 153 |
+
"execution_count": null,
|
| 154 |
+
"metadata": {},
|
| 155 |
+
"outputs": [],
|
| 156 |
+
"source": [
|
| 157 |
+
"from sklearn.model_selection import GridSearchCV\n",
|
| 158 |
+
"from sklearn.tree import DecisionTreeClassifier\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"dt = DecisionTreeClassifier()\n",
|
| 161 |
+
"params = {'max_depth': [3, 5, 10, None]}\n",
|
| 162 |
+
"\n",
|
| 163 |
+
"# YOUR CODE HERE\n"
|
| 164 |
+
]
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"cell_type": "markdown",
|
| 168 |
+
"metadata": {},
|
| 169 |
+
"source": [
|
| 170 |
+
"<details>\n",
|
| 171 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 172 |
+
"\n",
|
| 173 |
+
"```python\n",
|
| 174 |
+
"grid = GridSearchCV(dt, params, cv=5)\n",
|
| 175 |
+
"grid.fit(X, y)\n",
|
| 176 |
+
"print(\"Best parameters:\", grid.best_params_)\n",
|
| 177 |
+
"print(\"Best score:\", grid.best_score_)\n",
|
| 178 |
+
"```\n",
|
| 179 |
+
"</details>"
|
| 180 |
+
]
|
| 181 |
+
},
|
| 182 |
+
{
|
| 183 |
+
"cell_type": "markdown",
|
| 184 |
+
"metadata": {},
|
| 185 |
+
"source": [
|
| 186 |
+
"--- \n",
|
| 187 |
+
"### Excellent Utility Practice! \n",
|
| 188 |
+
"Using these tools ensures your ML experiments are robust and organized. \n",
|
| 189 |
+
"You have now covered all the core libraries!"
|
| 190 |
+
]
|
| 191 |
+
}
|
| 192 |
+
],
|
| 193 |
+
"metadata": {
|
| 194 |
+
"kernelspec": {
|
| 195 |
+
"display_name": "Python 3",
|
| 196 |
+
"language": "python",
|
| 197 |
+
"name": "python3"
|
| 198 |
+
},
|
| 199 |
+
"language_info": {
|
| 200 |
+
"codemirror_mode": {
|
| 201 |
+
"name": "ipython",
|
| 202 |
+
"version": 3
|
| 203 |
+
},
|
| 204 |
+
"file_extension": ".py",
|
| 205 |
+
"mimetype": "text/x-python",
|
| 206 |
+
"name": "python",
|
| 207 |
+
"nbconvert_exporter": "python",
|
| 208 |
+
"pygments_lexer": "ipython3",
|
| 209 |
+
"version": "3.12.7"
|
| 210 |
+
}
|
| 211 |
+
},
|
| 212 |
+
"nbformat": 4,
|
| 213 |
+
"nbformat_minor": 4
|
| 214 |
+
}
|
ML/08_Linear_Regression.ipynb
ADDED
|
@@ -0,0 +1,277 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 02 - Linear Regression\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"In this module, we will explore **Linear Regression**, one of the most fundamental algorithms in Machine Learning used for predicting continuous values.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Check out the [Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/) section on your hub to understand the Linear Algebra and Optimization (Gradient Descent) behind Linear Regression.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Preprocessing**: Prepare numeric and categorical features.\n",
|
| 16 |
+
"2. **Splitting**: Divide data into training and testing sets.\n",
|
| 17 |
+
"3. **Training**: Fit a Linear Regression model.\n",
|
| 18 |
+
"4. **Evaluation**: Use metrics like R-squared and Root Mean Squared Error (RMSE).\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"---"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "markdown",
|
| 25 |
+
"metadata": {},
|
| 26 |
+
"source": [
|
| 27 |
+
"## 1. Setup\n",
|
| 28 |
+
"We will use the `diamonds` dataset to predict the `price` of diamonds based on their features."
|
| 29 |
+
]
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
"cell_type": "code",
|
| 33 |
+
"execution_count": null,
|
| 34 |
+
"metadata": {},
|
| 35 |
+
"outputs": [],
|
| 36 |
+
"source": [
|
| 37 |
+
"import pandas as pd\n",
|
| 38 |
+
"import numpy as np\n",
|
| 39 |
+
"import matplotlib.pyplot as plt\n",
|
| 40 |
+
"import seaborn as sns\n",
|
| 41 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 42 |
+
"from sklearn.linear_model import LinearRegression\n",
|
| 43 |
+
"from sklearn.metrics import mean_squared_error, r2_score\n",
|
| 44 |
+
"\n",
|
| 45 |
+
"# Load dataset\n",
|
| 46 |
+
"df = sns.load_dataset('diamonds')\n",
|
| 47 |
+
"print(\"Dataset Shape:\", df.shape)\n",
|
| 48 |
+
"df.head()"
|
| 49 |
+
]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"cell_type": "markdown",
|
| 53 |
+
"metadata": {},
|
| 54 |
+
"source": [
|
| 55 |
+
"## 2. Preprocessing\n",
|
| 56 |
+
"\n",
|
| 57 |
+
"### Task 1: Encode Categorical Variables\n",
|
| 58 |
+
"The columns `cut`, `color`, and `clarity` are categorical. Use One-Hot Encoding to convert them."
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"cell_type": "code",
|
| 63 |
+
"execution_count": null,
|
| 64 |
+
"metadata": {},
|
| 65 |
+
"outputs": [],
|
| 66 |
+
"source": [
|
| 67 |
+
"# YOUR CODE HERE\n"
|
| 68 |
+
]
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"cell_type": "markdown",
|
| 72 |
+
"metadata": {},
|
| 73 |
+
"source": [
|
| 74 |
+
"<details>\n",
|
| 75 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 76 |
+
"\n",
|
| 77 |
+
"```python\n",
|
| 78 |
+
"df_encoded = pd.get_dummies(df, columns=['cut', 'color', 'clarity'], drop_first=True)\n",
|
| 79 |
+
"df_encoded.head()\n",
|
| 80 |
+
"```\n",
|
| 81 |
+
"</details>"
|
| 82 |
+
]
|
| 83 |
+
},
|
| 84 |
+
{
|
| 85 |
+
"cell_type": "markdown",
|
| 86 |
+
"metadata": {},
|
| 87 |
+
"source": [
|
| 88 |
+
"### Task 2: Features and Target Selection\n",
|
| 89 |
+
"Define `X` (features) and `y` (target: 'price')."
|
| 90 |
+
]
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"cell_type": "code",
|
| 94 |
+
"execution_count": null,
|
| 95 |
+
"metadata": {},
|
| 96 |
+
"outputs": [],
|
| 97 |
+
"source": [
|
| 98 |
+
"# YOUR CODE HERE\n"
|
| 99 |
+
]
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"cell_type": "markdown",
|
| 103 |
+
"metadata": {},
|
| 104 |
+
"source": [
|
| 105 |
+
"<details>\n",
|
| 106 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 107 |
+
"\n",
|
| 108 |
+
"```python\n",
|
| 109 |
+
"X = df_encoded.drop('price', axis=1)\n",
|
| 110 |
+
"y = df_encoded['price']\n",
|
| 111 |
+
"```\n",
|
| 112 |
+
"</details>"
|
| 113 |
+
]
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"cell_type": "markdown",
|
| 117 |
+
"metadata": {},
|
| 118 |
+
"source": [
|
| 119 |
+
"### Task 3: Train-Test Split\n",
|
| 120 |
+
"Split the data into 80% training and 20% testing."
|
| 121 |
+
]
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"cell_type": "code",
|
| 125 |
+
"execution_count": null,
|
| 126 |
+
"metadata": {},
|
| 127 |
+
"outputs": [],
|
| 128 |
+
"source": [
|
| 129 |
+
"# YOUR CODE HERE\n"
|
| 130 |
+
]
|
| 131 |
+
},
|
| 132 |
+
{
|
| 133 |
+
"cell_type": "markdown",
|
| 134 |
+
"metadata": {},
|
| 135 |
+
"source": [
|
| 136 |
+
"<details>\n",
|
| 137 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 138 |
+
"\n",
|
| 139 |
+
"```python\n",
|
| 140 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
| 141 |
+
"print(f\"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}\")\n",
|
| 142 |
+
"```\n",
|
| 143 |
+
"</details>"
|
| 144 |
+
]
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"cell_type": "markdown",
|
| 148 |
+
"metadata": {},
|
| 149 |
+
"source": [
|
| 150 |
+
"## 3. Modeling\n",
|
| 151 |
+
"\n",
|
| 152 |
+
"### Task 4: Training the Model\n",
|
| 153 |
+
"Create a LinearRegression object and fit it on the training data."
|
| 154 |
+
]
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"cell_type": "code",
|
| 158 |
+
"execution_count": null,
|
| 159 |
+
"metadata": {},
|
| 160 |
+
"outputs": [],
|
| 161 |
+
"source": [
|
| 162 |
+
"# YOUR CODE HERE\n"
|
| 163 |
+
]
|
| 164 |
+
},
|
| 165 |
+
{
|
| 166 |
+
"cell_type": "markdown",
|
| 167 |
+
"metadata": {},
|
| 168 |
+
"source": [
|
| 169 |
+
"<details>\n",
|
| 170 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 171 |
+
"\n",
|
| 172 |
+
"```python\n",
|
| 173 |
+
"model = LinearRegression()\n",
|
| 174 |
+
"model.fit(X_train, y_train)\n",
|
| 175 |
+
"```\n",
|
| 176 |
+
"</details>"
|
| 177 |
+
]
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"cell_type": "markdown",
|
| 181 |
+
"metadata": {},
|
| 182 |
+
"source": [
|
| 183 |
+
"### Task 5: Making Predictions\n",
|
| 184 |
+
"Predict the values for the test set."
|
| 185 |
+
]
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"cell_type": "code",
|
| 189 |
+
"execution_count": null,
|
| 190 |
+
"metadata": {},
|
| 191 |
+
"outputs": [],
|
| 192 |
+
"source": [
|
| 193 |
+
"# YOUR CODE HERE\n"
|
| 194 |
+
]
|
| 195 |
+
},
|
| 196 |
+
{
|
| 197 |
+
"cell_type": "markdown",
|
| 198 |
+
"metadata": {},
|
| 199 |
+
"source": [
|
| 200 |
+
"<details>\n",
|
| 201 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 202 |
+
"\n",
|
| 203 |
+
"```python\n",
|
| 204 |
+
"y_pred = model.predict(X_test)\n",
|
| 205 |
+
"```\n",
|
| 206 |
+
"</details>"
|
| 207 |
+
]
|
| 208 |
+
},
|
| 209 |
+
{
|
| 210 |
+
"cell_type": "markdown",
|
| 211 |
+
"metadata": {},
|
| 212 |
+
"source": [
|
| 213 |
+
"## 4. Evaluation\n",
|
| 214 |
+
"\n",
|
| 215 |
+
"### Task 6: Error Metrics\n",
|
| 216 |
+
"Calculate R2 Score and RMSE."
|
| 217 |
+
]
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"cell_type": "code",
|
| 221 |
+
"execution_count": null,
|
| 222 |
+
"metadata": {},
|
| 223 |
+
"outputs": [],
|
| 224 |
+
"source": [
|
| 225 |
+
"# YOUR CODE HERE\n"
|
| 226 |
+
]
|
| 227 |
+
},
|
| 228 |
+
{
|
| 229 |
+
"cell_type": "markdown",
|
| 230 |
+
"metadata": {},
|
| 231 |
+
"source": [
|
| 232 |
+
"<details>\n",
|
| 233 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 234 |
+
"\n",
|
| 235 |
+
"```python\n",
|
| 236 |
+
"r2 = r2_score(y_test, y_pred)\n",
|
| 237 |
+
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
|
| 238 |
+
"\n",
|
| 239 |
+
"print(f\"R2 Score: {r2:.4f}\")\n",
|
| 240 |
+
"print(f\"RMSE: {rmse:.2f}\")\n",
|
| 241 |
+
"```\n",
|
| 242 |
+
"</details>"
|
| 243 |
+
]
|
| 244 |
+
},
|
| 245 |
+
{
|
| 246 |
+
"cell_type": "markdown",
|
| 247 |
+
"metadata": {},
|
| 248 |
+
"source": [
|
| 249 |
+
"--- \n",
|
| 250 |
+
"### Well Done! \n",
|
| 251 |
+
"You have successfully built and evaluated a Linear Regression model. \n",
|
| 252 |
+
"Next module: **Logistic Regression** for classification!"
|
| 253 |
+
]
|
| 254 |
+
}
|
| 255 |
+
],
|
| 256 |
+
"metadata": {
|
| 257 |
+
"kernelspec": {
|
| 258 |
+
"display_name": "Python 3",
|
| 259 |
+
"language": "python",
|
| 260 |
+
"name": "python3"
|
| 261 |
+
},
|
| 262 |
+
"language_info": {
|
| 263 |
+
"codemirror_mode": {
|
| 264 |
+
"name": "ipython",
|
| 265 |
+
"version": 3
|
| 266 |
+
},
|
| 267 |
+
"file_extension": ".py",
|
| 268 |
+
"mimetype": "text/x-python",
|
| 269 |
+
"name": "python",
|
| 270 |
+
"nbconvert_exporter": "python",
|
| 271 |
+
"pygments_lexer": "ipython3",
|
| 272 |
+
"version": "3.8.0"
|
| 273 |
+
}
|
| 274 |
+
},
|
| 275 |
+
"nbformat": 4,
|
| 276 |
+
"nbformat_minor": 4
|
| 277 |
+
}
|
ML/09_Logistic_Regression.ipynb
ADDED
|
@@ -0,0 +1,228 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 03 - Logistic Regression\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 03! Today we dive into **Logistic Regression**, the go-to algorithm for binary classification.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Logistic Regression Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub to understand the Sigmoid function and how probability thresholds work.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Scaling**: Understand why feature scaling is important.\n",
|
| 16 |
+
"2. **Classification**: Distinguish between regression and classification.\n",
|
| 17 |
+
"3. **Performance Metrics**: Learn how to interpret a Confusion Matrix and ROC Curve.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Setup\n",
|
| 27 |
+
"We will use the **Breast Cancer Wisconsin** dataset from Scikit-Learn."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import pandas as pd\n",
|
| 37 |
+
"import numpy as np\n",
|
| 38 |
+
"import matplotlib.pyplot as plt\n",
|
| 39 |
+
"import seaborn as sns\n",
|
| 40 |
+
"from sklearn.datasets import load_breast_cancer\n",
|
| 41 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 42 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
| 43 |
+
"from sklearn.linear_model import LogisticRegression\n",
|
| 44 |
+
"from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, auc\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"# Load dataset\n",
|
| 47 |
+
"data = load_breast_cancer()\n",
|
| 48 |
+
"df = pd.DataFrame(data.data, columns=data.feature_names)\n",
|
| 49 |
+
"df['target'] = data.target\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"print(\"Dataset Shape:\", df.shape)\n",
|
| 52 |
+
"df.head()"
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"cell_type": "markdown",
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"source": [
|
| 59 |
+
"## 2. Preprocessing\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"### Task 1: Train-Test Split\n",
|
| 62 |
+
"Split the data (X, y) with a test size of 0.25."
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "code",
|
| 67 |
+
"execution_count": null,
|
| 68 |
+
"metadata": {},
|
| 69 |
+
"outputs": [],
|
| 70 |
+
"source": [
|
| 71 |
+
"# YOUR CODE HERE\n"
|
| 72 |
+
]
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"cell_type": "markdown",
|
| 76 |
+
"metadata": {},
|
| 77 |
+
"source": [
|
| 78 |
+
"<details>\n",
|
| 79 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"```python\n",
|
| 82 |
+
"X = df.drop('target', axis=1)\n",
|
| 83 |
+
"y = df['target']\n",
|
| 84 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)\n",
|
| 85 |
+
"```\n",
|
| 86 |
+
"</details>"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"cell_type": "markdown",
|
| 91 |
+
"metadata": {},
|
| 92 |
+
"source": [
|
| 93 |
+
"### Task 2: Standard Scaling\n",
|
| 94 |
+
"Scale the features using `StandardScaler`.\n",
|
| 95 |
+
"\n",
|
| 96 |
+
"*Web Reference: Check the [Scaling Demo](https://aashishgarg13.github.io/DataScience/feature-engineering/) to see visual differences between Standard and MinMax scalers.*"
|
| 97 |
+
]
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"cell_type": "code",
|
| 101 |
+
"execution_count": null,
|
| 102 |
+
"metadata": {},
|
| 103 |
+
"outputs": [],
|
| 104 |
+
"source": [
|
| 105 |
+
"# YOUR CODE HERE\n"
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "markdown",
|
| 110 |
+
"metadata": {},
|
| 111 |
+
"source": [
|
| 112 |
+
"<details>\n",
|
| 113 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 114 |
+
"\n",
|
| 115 |
+
"```python\n",
|
| 116 |
+
"scaler = StandardScaler()\n",
|
| 117 |
+
"X_train_scaled = scaler.fit_transform(X_train)\n",
|
| 118 |
+
"X_test_scaled = scaler.transform(X_test)\n",
|
| 119 |
+
"```\n",
|
| 120 |
+
"</details>"
|
| 121 |
+
]
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"cell_type": "markdown",
|
| 125 |
+
"metadata": {},
|
| 126 |
+
"source": [
|
| 127 |
+
"## 3. Modeling\n",
|
| 128 |
+
"\n",
|
| 129 |
+
"### Task 3: Training\n",
|
| 130 |
+
"Initialize and fit the `LogisticRegression` model."
|
| 131 |
+
]
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"cell_type": "code",
|
| 135 |
+
"execution_count": null,
|
| 136 |
+
"metadata": {},
|
| 137 |
+
"outputs": [],
|
| 138 |
+
"source": [
|
| 139 |
+
"# YOUR CODE HERE\n"
|
| 140 |
+
]
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
"cell_type": "markdown",
|
| 144 |
+
"metadata": {},
|
| 145 |
+
"source": [
|
| 146 |
+
"<details>\n",
|
| 147 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 148 |
+
"\n",
|
| 149 |
+
"```python\n",
|
| 150 |
+
"model = LogisticRegression()\n",
|
| 151 |
+
"model.fit(X_train_scaled, y_train)\n",
|
| 152 |
+
"```\n",
|
| 153 |
+
"</details>"
|
| 154 |
+
]
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"cell_type": "markdown",
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"source": [
|
| 160 |
+
"## 4. Evaluation\n",
|
| 161 |
+
"\n",
|
| 162 |
+
"### Task 4: Confusion Matrix & ROC Curve\n",
|
| 163 |
+
"Plot the confusion matrix and calculate the ROC-AUC score.\n",
|
| 164 |
+
"\n",
|
| 165 |
+
"*Web Reference: [Model Evaluation Interactive](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*"
|
| 166 |
+
]
|
| 167 |
+
},
|
| 168 |
+
{
|
| 169 |
+
"cell_type": "code",
|
| 170 |
+
"execution_count": null,
|
| 171 |
+
"metadata": {},
|
| 172 |
+
"outputs": [],
|
| 173 |
+
"source": [
|
| 174 |
+
"# YOUR CODE HERE\n"
|
| 175 |
+
]
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"cell_type": "markdown",
|
| 179 |
+
"metadata": {},
|
| 180 |
+
"source": [
|
| 181 |
+
"<details>\n",
|
| 182 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 183 |
+
"\n",
|
| 184 |
+
"```python\n",
|
| 185 |
+
"y_pred = model.predict(X_test_scaled)\n",
|
| 186 |
+
"cm = confusion_matrix(y_test, y_pred)\n",
|
| 187 |
+
"sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n",
|
| 188 |
+
"plt.title('Confusion Matrix')\n",
|
| 189 |
+
"plt.show()\n",
|
| 190 |
+
"\n",
|
| 191 |
+
"print(classification_report(y_test, y_pred))\n",
|
| 192 |
+
"```\n",
|
| 193 |
+
"</details>"
|
| 194 |
+
]
|
| 195 |
+
},
|
| 196 |
+
{
|
| 197 |
+
"cell_type": "markdown",
|
| 198 |
+
"metadata": {},
|
| 199 |
+
"source": [
|
| 200 |
+
"--- \n",
|
| 201 |
+
"### Excellent Work! \n",
|
| 202 |
+
"You've mastered Logistic Regression basics and integrated it with your website resources.\n",
|
| 203 |
+
"In the next module, we move to non-linear models: **Decision Trees and Random Forests**."
|
| 204 |
+
]
|
| 205 |
+
}
|
| 206 |
+
],
|
| 207 |
+
"metadata": {
|
| 208 |
+
"kernelspec": {
|
| 209 |
+
"display_name": "Python 3",
|
| 210 |
+
"language": "python",
|
| 211 |
+
"name": "python3"
|
| 212 |
+
},
|
| 213 |
+
"language_info": {
|
| 214 |
+
"codemirror_mode": {
|
| 215 |
+
"name": "ipython",
|
| 216 |
+
"version": 3
|
| 217 |
+
},
|
| 218 |
+
"file_extension": ".py",
|
| 219 |
+
"mimetype": "text/x-python",
|
| 220 |
+
"name": "python",
|
| 221 |
+
"nbconvert_exporter": "python",
|
| 222 |
+
"pygments_lexer": "ipython3",
|
| 223 |
+
"version": "3.8.0"
|
| 224 |
+
}
|
| 225 |
+
},
|
| 226 |
+
"nbformat": 4,
|
| 227 |
+
"nbformat_minor": 4
|
| 228 |
+
}
|
ML/10_Support_Vector_Machines.ipynb
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 06 - Support Vector Machines (SVM)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 06! We're exploring **Support Vector Machines**, a powerful algorithm for both linear and non-linear classification.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Visit the **[Machine Learning Guide - SVM Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub to see interactive demos of how the margin changes and how kernels project data into higher dimensions.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Maximum Margin**: Understanding support vectors.\n",
|
| 16 |
+
"2. **The Kernel Trick**: Handling non-linear data.\n",
|
| 17 |
+
"3. **Regularization (C Parameter)**: Hard vs Soft margins.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Environment Setup"
|
| 27 |
+
]
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"cell_type": "code",
|
| 31 |
+
"execution_count": null,
|
| 32 |
+
"metadata": {},
|
| 33 |
+
"outputs": [],
|
| 34 |
+
"source": [
|
| 35 |
+
"import pandas as pd\n",
|
| 36 |
+
"import numpy as np\n",
|
| 37 |
+
"import matplotlib.pyplot as plt\n",
|
| 38 |
+
"import seaborn as sns\n",
|
| 39 |
+
"from sklearn.svm import SVC\n",
|
| 40 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 41 |
+
"from sklearn.metrics import accuracy_score, confusion_matrix\n",
|
| 42 |
+
"from sklearn.datasets import make_moons\n",
|
| 43 |
+
"\n",
|
| 44 |
+
"# Generate non-linear data (Moons)\n",
|
| 45 |
+
"X, y = make_moons(n_samples=200, noise=0.15, random_state=42)\n",
|
| 46 |
+
"plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis')\n",
|
| 47 |
+
"plt.title(\"Non-Linearly Separable Data\")\n",
|
| 48 |
+
"plt.show()"
|
| 49 |
+
]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"cell_type": "markdown",
|
| 53 |
+
"metadata": {},
|
| 54 |
+
"source": [
|
| 55 |
+
"## 2. Linear SVM\n",
|
| 56 |
+
"\n",
|
| 57 |
+
"### Task 1: Training a Linear SVM\n",
|
| 58 |
+
"Try fitting a linear SVM to this non-linear data and check the accuracy."
|
| 59 |
+
]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"cell_type": "code",
|
| 63 |
+
"execution_count": null,
|
| 64 |
+
"metadata": {},
|
| 65 |
+
"outputs": [],
|
| 66 |
+
"source": [
|
| 67 |
+
"# YOUR CODE HERE\n"
|
| 68 |
+
]
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"cell_type": "markdown",
|
| 72 |
+
"metadata": {},
|
| 73 |
+
"source": [
|
| 74 |
+
"<details>\n",
|
| 75 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 76 |
+
"\n",
|
| 77 |
+
"```python\n",
|
| 78 |
+
"svm_linear = SVC(kernel='linear')\n",
|
| 79 |
+
"svm_linear.fit(X, y)\n",
|
| 80 |
+
"y_pred = svm_linear.predict(X)\n",
|
| 81 |
+
"print(f\"Linear SVM Accuracy: {accuracy_score(y, y_pred):.4f}\")\n",
|
| 82 |
+
"```\n",
|
| 83 |
+
"</details>"
|
| 84 |
+
]
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"cell_type": "markdown",
|
| 88 |
+
"metadata": {},
|
| 89 |
+
"source": [
|
| 90 |
+
"## 3. The Kernel Trick\n",
|
| 91 |
+
"\n",
|
| 92 |
+
"### Task 2: Polynomial and RBF Kernels\n",
|
| 93 |
+
"Train SVM with `poly` and `rbf` kernels. Which one performs better?\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"*Web Reference: Check the [SVM Kernel Demo](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) to see how kernels transform data.*"
|
| 96 |
+
]
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"cell_type": "code",
|
| 100 |
+
"execution_count": null,
|
| 101 |
+
"metadata": {},
|
| 102 |
+
"outputs": [],
|
| 103 |
+
"source": [
|
| 104 |
+
"# YOUR CODE HERE\n"
|
| 105 |
+
]
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"cell_type": "markdown",
|
| 109 |
+
"metadata": {},
|
| 110 |
+
"source": [
|
| 111 |
+
"<details>\n",
|
| 112 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 113 |
+
"\n",
|
| 114 |
+
"```python\n",
|
| 115 |
+
"svm_rbf = SVC(kernel='rbf', gamma=1)\n",
|
| 116 |
+
"svm_rbf.fit(X, y)\n",
|
| 117 |
+
"y_pred_rbf = svm_rbf.predict(X)\n",
|
| 118 |
+
"print(f\"RBF SVM Accuracy: {accuracy_score(y, y_pred_rbf):.4f}\")\n",
|
| 119 |
+
"```\n",
|
| 120 |
+
"</details>"
|
| 121 |
+
]
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"cell_type": "markdown",
|
| 125 |
+
"metadata": {},
|
| 126 |
+
"source": [
|
| 127 |
+
"## 4. Tuning the C Parameter\n",
|
| 128 |
+
"\n",
|
| 129 |
+
"### Task 3: Impact of C\n",
|
| 130 |
+
"Experiment with very small C (e.g., 0.01) and very large C (e.g., 1000). Monitor the change in decision boundaries.\n",
|
| 131 |
+
"\n",
|
| 132 |
+
"*Hint: Use the [C-Parameter Visualization](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) on your site to see hard vs soft margin.*"
|
| 133 |
+
]
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"cell_type": "code",
|
| 137 |
+
"execution_count": null,
|
| 138 |
+
"metadata": {},
|
| 139 |
+
"outputs": [],
|
| 140 |
+
"source": [
|
| 141 |
+
"# YOUR CODE HERE\n"
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"cell_type": "markdown",
|
| 146 |
+
"metadata": {},
|
| 147 |
+
"source": [
|
| 148 |
+
"<details>\n",
|
| 149 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"```python\n",
|
| 152 |
+
"def plot_svm_boundary(C_val):\n",
|
| 153 |
+
" model = SVC(kernel='rbf', C=C_val)\n",
|
| 154 |
+
" model.fit(X, y)\n",
|
| 155 |
+
" # (Standard boundary plotting code would go here)\n",
|
| 156 |
+
" print(f\"SVM trained with C={C_val}\")\n",
|
| 157 |
+
"\n",
|
| 158 |
+
"plot_svm_boundary(0.01)\n",
|
| 159 |
+
"plot_svm_boundary(1000)\n",
|
| 160 |
+
"```\n",
|
| 161 |
+
"</details>"
|
| 162 |
+
]
|
| 163 |
+
},
|
| 164 |
+
{
|
| 165 |
+
"cell_type": "markdown",
|
| 166 |
+
"metadata": {},
|
| 167 |
+
"source": [
|
| 168 |
+
"--- \n",
|
| 169 |
+
"### Great work! \n",
|
| 170 |
+
"SVM is a classic example of how high-dimensional projection can solve complex problems.\n",
|
| 171 |
+
"Next module: **Advanced Ensemble Methods (XGBoost & Boosting)**."
|
| 172 |
+
]
|
| 173 |
+
}
|
| 174 |
+
],
|
| 175 |
+
"metadata": {
|
| 176 |
+
"kernelspec": {
|
| 177 |
+
"display_name": "Python 3",
|
| 178 |
+
"language": "python",
|
| 179 |
+
"name": "python3"
|
| 180 |
+
},
|
| 181 |
+
"language_info": {
|
| 182 |
+
"codemirror_mode": {
|
| 183 |
+
"name": "ipython",
|
| 184 |
+
"version": 3
|
| 185 |
+
},
|
| 186 |
+
"file_extension": ".py",
|
| 187 |
+
"mimetype": "text/x-python",
|
| 188 |
+
"name": "python",
|
| 189 |
+
"nbconvert_exporter": "python",
|
| 190 |
+
"pygments_lexer": "ipython3",
|
| 191 |
+
"version": "3.8.0"
|
| 192 |
+
}
|
| 193 |
+
},
|
| 194 |
+
"nbformat": 4,
|
| 195 |
+
"nbformat_minor": 4
|
| 196 |
+
}
|
ML/11_K_Nearest_Neighbors.ipynb
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 12 - K-Nearest Neighbors (KNN)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 12! We're exploring **KNN**, a simple yet powerful instance-based learning algorithm used for both classification and regression.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Visit the **[KNN Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub to see how the decision boundary changes as you increase $K$ and how different distance metrics (Euclidean vs Manhattan) affect the results.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Instance-based Learning**: Understanding that KNN doesn't \"learn\" a model but stores training data.\n",
|
| 16 |
+
"2. **Feature Scaling**: Why it's absolutely critical for distance-based models.\n",
|
| 17 |
+
"3. **The Elbow Method for K**: Choosing the optimal number of neighbors.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Setup\n",
|
| 27 |
+
"We will use the **Iris** dataset for this classification task."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import pandas as pd\n",
|
| 37 |
+
"import numpy as np\n",
|
| 38 |
+
"import matplotlib.pyplot as plt\n",
|
| 39 |
+
"import seaborn as sns\n",
|
| 40 |
+
"from sklearn.datasets import load_iris\n",
|
| 41 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 42 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
| 43 |
+
"from sklearn.neighbors import KNeighborsClassifier\n",
|
| 44 |
+
"from sklearn.metrics import classification_report, accuracy_score\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"# Load dataset\n",
|
| 47 |
+
"iris = load_iris()\n",
|
| 48 |
+
"X = iris.data\n",
|
| 49 |
+
"y = iris.target\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"print(\"Features:\", iris.feature_names)\n",
|
| 52 |
+
"print(\"Classes:\", iris.target_names)"
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"cell_type": "markdown",
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"source": [
|
| 59 |
+
"## 2. Preprocessing\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"### Task 1: Scaling is Mandatory\n",
|
| 62 |
+
"Split the data (20% test) and scale it using `StandardScaler`."
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "code",
|
| 67 |
+
"execution_count": null,
|
| 68 |
+
"metadata": {},
|
| 69 |
+
"outputs": [],
|
| 70 |
+
"source": [
|
| 71 |
+
"# YOUR CODE HERE\n"
|
| 72 |
+
]
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"cell_type": "markdown",
|
| 76 |
+
"metadata": {},
|
| 77 |
+
"source": [
|
| 78 |
+
"<details>\n",
|
| 79 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"```python\n",
|
| 82 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
| 83 |
+
"scaler = StandardScaler()\n",
|
| 84 |
+
"X_train = scaler.fit_transform(X_train)\n",
|
| 85 |
+
"X_test = scaler.transform(X_test)\n",
|
| 86 |
+
"```\n",
|
| 87 |
+
"</details>"
|
| 88 |
+
]
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"cell_type": "markdown",
|
| 92 |
+
"metadata": {},
|
| 93 |
+
"source": [
|
| 94 |
+
"## 3. Training & Tuning\n",
|
| 95 |
+
"\n",
|
| 96 |
+
"### Task 2: Choosing K\n",
|
| 97 |
+
"Loop through values of $K$ from 1 to 20 and plot the error rate to find the \"elbow\"."
|
| 98 |
+
]
|
| 99 |
+
},
|
| 100 |
+
{
|
| 101 |
+
"cell_type": "code",
|
| 102 |
+
"execution_count": null,
|
| 103 |
+
"metadata": {},
|
| 104 |
+
"outputs": [],
|
| 105 |
+
"source": [
|
| 106 |
+
"# YOUR CODE HERE\n"
|
| 107 |
+
]
|
| 108 |
+
},
|
| 109 |
+
{
|
| 110 |
+
"cell_type": "markdown",
|
| 111 |
+
"metadata": {},
|
| 112 |
+
"source": [
|
| 113 |
+
"<details>\n",
|
| 114 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 115 |
+
"\n",
|
| 116 |
+
"```python\n",
|
| 117 |
+
"error_rate = []\n",
|
| 118 |
+
"for i in range(1, 21):\n",
|
| 119 |
+
" knn = KNeighborsClassifier(n_neighbors=i)\n",
|
| 120 |
+
" knn.fit(X_train, y_train)\n",
|
| 121 |
+
" pred_i = knn.predict(X_test)\n",
|
| 122 |
+
" error_rate.append(np.mean(pred_i != y_test))\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"plt.figure(figsize=(10,6))\n",
|
| 125 |
+
"plt.plot(range(1,21), error_rate, color='blue', linestyle='dashed', marker='o')\n",
|
| 126 |
+
"plt.title('Error Rate vs. K Value')\n",
|
| 127 |
+
"plt.xlabel('K')\n",
|
| 128 |
+
"plt.ylabel('Error Rate')\n",
|
| 129 |
+
"plt.show()\n",
|
| 130 |
+
"```\n",
|
| 131 |
+
"</details>"
|
| 132 |
+
]
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"cell_type": "markdown",
|
| 136 |
+
"metadata": {},
|
| 137 |
+
"source": [
|
| 138 |
+
"## 4. Final Evaluation\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"### Task 3: Train Final Model\n",
|
| 141 |
+
"Based on your plot, choose the best $K$ and print the classification report."
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"cell_type": "code",
|
| 146 |
+
"execution_count": null,
|
| 147 |
+
"metadata": {},
|
| 148 |
+
"outputs": [],
|
| 149 |
+
"source": [
|
| 150 |
+
"# YOUR CODE HERE\n"
|
| 151 |
+
]
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"cell_type": "markdown",
|
| 155 |
+
"metadata": {},
|
| 156 |
+
"source": [
|
| 157 |
+
"<details>\n",
|
| 158 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"```python\n",
|
| 161 |
+
"knn = KNeighborsClassifier(n_neighbors=3)\n",
|
| 162 |
+
"knn.fit(X_train, y_train)\n",
|
| 163 |
+
"y_pred = knn.predict(X_test)\n",
|
| 164 |
+
"print(classification_report(y_test, y_pred))\n",
|
| 165 |
+
"```\n",
|
| 166 |
+
"</details>"
|
| 167 |
+
]
|
| 168 |
+
},
|
| 169 |
+
{
|
| 170 |
+
"cell_type": "markdown",
|
| 171 |
+
"metadata": {},
|
| 172 |
+
"source": [
|
| 173 |
+
"--- \n",
|
| 174 |
+
"### Great Job! \n",
|
| 175 |
+
"You've mastered one of the most intuitive algorithms in ML.\n",
|
| 176 |
+
"Next: **Naive Bayes**."
|
| 177 |
+
]
|
| 178 |
+
}
|
| 179 |
+
],
|
| 180 |
+
"metadata": {
|
| 181 |
+
"kernelspec": {
|
| 182 |
+
"display_name": "Python 3",
|
| 183 |
+
"language": "python",
|
| 184 |
+
"name": "python3"
|
| 185 |
+
},
|
| 186 |
+
"language_info": {
|
| 187 |
+
"codemirror_mode": {
|
| 188 |
+
"name": "ipython",
|
| 189 |
+
"version": 3
|
| 190 |
+
},
|
| 191 |
+
"file_extension": ".py",
|
| 192 |
+
"mimetype": "text/x-python",
|
| 193 |
+
"name": "python",
|
| 194 |
+
"nbconvert_exporter": "python",
|
| 195 |
+
"pygments_lexer": "ipython3",
|
| 196 |
+
"version": "3.12.7"
|
| 197 |
+
}
|
| 198 |
+
},
|
| 199 |
+
"nbformat": 4,
|
| 200 |
+
"nbformat_minor": 4
|
| 201 |
+
}
|
ML/12_Naive_Bayes.ipynb
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 13 - Naive Bayes\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 13! We're exploring **Naive Bayes**, a probabilistic classifier based on Bayes' Theorem with the \"naive\" assumption of independence between features.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Naive Bayes Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for the mathematical derivation of $P(A|B)$ and how it's used in spam filtering.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Bayes Theorem**: Calculating posterior probability.\n",
|
| 16 |
+
"2. **Different Variants**: Gaussian vs Multinomial vs Bernoulli.\n",
|
| 17 |
+
"3. **Text Classification**: Using Naive Bayes for NLP tasks.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Setup\n",
|
| 27 |
+
"We will use a small text dataset for **Spam detection**."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import pandas as pd \n",
|
| 37 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 38 |
+
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
| 39 |
+
"from sklearn.naive_bayes import MultinomialNB\n",
|
| 40 |
+
"from sklearn.metrics import accuracy_score, confusion_matrix\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# Sample Text Data\n",
|
| 43 |
+
"data = {\n",
|
| 44 |
+
" 'text': [\n",
|
| 45 |
+
" 'Free money now!', \n",
|
| 46 |
+
" 'Hi, how are you?', \n",
|
| 47 |
+
" 'Limited offer, buy now!', \n",
|
| 48 |
+
" 'Meeting at 5pm', \n",
|
| 49 |
+
" 'Win a prize today!', \n",
|
| 50 |
+
" 'Review the documents'\n",
|
| 51 |
+
" ],\n",
|
| 52 |
+
" 'label': [1, 0, 1, 0, 1, 0] # 1 = Spam, 0 = Ham\n",
|
| 53 |
+
"}\n",
|
| 54 |
+
"df = pd.DataFrame(data)\n",
|
| 55 |
+
"df"
|
| 56 |
+
]
|
| 57 |
+
},
|
| 58 |
+
{
|
| 59 |
+
"cell_type": "markdown",
|
| 60 |
+
"metadata": {},
|
| 61 |
+
"source": [
|
| 62 |
+
"## 2. Text Preprocessing\n",
|
| 63 |
+
"\n",
|
| 64 |
+
"### Task 1: Vectorization\n",
|
| 65 |
+
"Machine learning models can't read text directly. Use `CountVectorizer` to convert text into a matrix of token counts."
|
| 66 |
+
]
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"cell_type": "code",
|
| 70 |
+
"execution_count": null,
|
| 71 |
+
"metadata": {},
|
| 72 |
+
"outputs": [],
|
| 73 |
+
"source": [
|
| 74 |
+
"# YOUR CODE HERE\n"
|
| 75 |
+
]
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"cell_type": "markdown",
|
| 79 |
+
"metadata": {},
|
| 80 |
+
"source": [
|
| 81 |
+
"<details>\n",
|
| 82 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 83 |
+
"\n",
|
| 84 |
+
"```python\n",
|
| 85 |
+
"cv = CountVectorizer(stop_words='english')\n",
|
| 86 |
+
"X = cv.fit_transform(df['text'])\n",
|
| 87 |
+
"y = df['label']\n",
|
| 88 |
+
"```\n",
|
| 89 |
+
"</details>"
|
| 90 |
+
]
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"cell_type": "markdown",
|
| 94 |
+
"metadata": {},
|
| 95 |
+
"source": [
|
| 96 |
+
"## 3. Training & Prediction\n",
|
| 97 |
+
"\n",
|
| 98 |
+
"### Task 2: Multinomial NB\n",
|
| 99 |
+
"Fit a `MultinomialNB` model and predict the class for a new message: \"Win money buy now\"."
|
| 100 |
+
]
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"cell_type": "code",
|
| 104 |
+
"execution_count": null,
|
| 105 |
+
"metadata": {},
|
| 106 |
+
"outputs": [],
|
| 107 |
+
"source": [
|
| 108 |
+
"# YOUR CODE HERE\n"
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"cell_type": "markdown",
|
| 113 |
+
"metadata": {},
|
| 114 |
+
"source": [
|
| 115 |
+
"<details>\n",
|
| 116 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"```python\n",
|
| 119 |
+
"nb = MultinomialNB()\n",
|
| 120 |
+
"nb.fit(X, y)\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"new_msg = [\"Win money buy now\"]\n",
|
| 123 |
+
"new_vec = cv.transform(new_msg)\n",
|
| 124 |
+
"prediction = nb.predict(new_vec)\n",
|
| 125 |
+
"print(\"Spam\" if prediction[0] == 1 else \"Ham\")\n",
|
| 126 |
+
"```\n",
|
| 127 |
+
"</details>"
|
| 128 |
+
]
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"cell_type": "markdown",
|
| 132 |
+
"metadata": {},
|
| 133 |
+
"source": [
|
| 134 |
+
"--- \n",
|
| 135 |
+
"### Excellent Probabilistic Thinking! \n",
|
| 136 |
+
"Naive Bayes is often the baseline for NLP projects because it's fast and effective.\n",
|
| 137 |
+
"Next: **Gradient Boosting & XGBoost**."
|
| 138 |
+
]
|
| 139 |
+
}
|
| 140 |
+
],
|
| 141 |
+
"metadata": {
|
| 142 |
+
"kernelspec": {
|
| 143 |
+
"display_name": "Python 3",
|
| 144 |
+
"language": "python",
|
| 145 |
+
"name": "python3"
|
| 146 |
+
},
|
| 147 |
+
"language_info": {
|
| 148 |
+
"codemirror_mode": {
|
| 149 |
+
"name": "ipython",
|
| 150 |
+
"version": 3
|
| 151 |
+
},
|
| 152 |
+
"file_extension": ".py",
|
| 153 |
+
"mimetype": "text/x-python",
|
| 154 |
+
"name": "python",
|
| 155 |
+
"nbconvert_exporter": "python",
|
| 156 |
+
"pygments_lexer": "ipython3",
|
| 157 |
+
"version": "3.12.7"
|
| 158 |
+
}
|
| 159 |
+
},
|
| 160 |
+
"nbformat": 4,
|
| 161 |
+
"nbformat_minor": 4
|
| 162 |
+
}
|
ML/13_Decision_Trees_and_Random_Forests.ipynb
ADDED
|
@@ -0,0 +1,258 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 04 - Decision Trees & Random Forests\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 04! We are moving into the world of **Tree-Based Models**. These are powerful, interpretable, and form the basis for state-of-the-art algorithms.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Decision Trees**: Understand how models split data.\n",
|
| 13 |
+
"2. **Random Forests**: Learn about Ensembles and Bagging.\n",
|
| 14 |
+
"3. **Interpretability**: Analyze Feature Importance.\n",
|
| 15 |
+
"\n",
|
| 16 |
+
"---"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"metadata": {},
|
| 22 |
+
"source": [
|
| 23 |
+
"## 1. Setup\n",
|
| 24 |
+
"We will use the **Penguins** dataset to classify penguin species."
|
| 25 |
+
]
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"cell_type": "code",
|
| 29 |
+
"execution_count": null,
|
| 30 |
+
"metadata": {},
|
| 31 |
+
"outputs": [],
|
| 32 |
+
"source": [
|
| 33 |
+
"import pandas as pd\n",
|
| 34 |
+
"import numpy as np\n",
|
| 35 |
+
"import matplotlib.pyplot as plt\n",
|
| 36 |
+
"import seaborn as sns\n",
|
| 37 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 38 |
+
"from sklearn.tree import DecisionTreeClassifier, plot_tree\n",
|
| 39 |
+
"from sklearn.ensemble import RandomForestClassifier\n",
|
| 40 |
+
"from sklearn.metrics import classification_report, accuracy_score\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# Load dataset\n",
|
| 43 |
+
"df = sns.load_dataset('penguins')\n",
|
| 44 |
+
"print(\"Dataset Shape:\", df.shape)\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"# Quick clean-up (dropping missing values for this exercise)\n",
|
| 47 |
+
"df.dropna(inplace=True)\n",
|
| 48 |
+
"df.head()"
|
| 49 |
+
]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"cell_type": "markdown",
|
| 53 |
+
"metadata": {},
|
| 54 |
+
"source": [
|
| 55 |
+
"## 2. Preprocessing\n",
|
| 56 |
+
"\n",
|
| 57 |
+
"### Task 1: Label Encoding and One-Hot Encoding\n",
|
| 58 |
+
"1. Convert target `species` into codes.\n",
|
| 59 |
+
"2. One-Hot Encode `island` and `sex`."
|
| 60 |
+
]
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"cell_type": "code",
|
| 64 |
+
"execution_count": null,
|
| 65 |
+
"metadata": {},
|
| 66 |
+
"outputs": [],
|
| 67 |
+
"source": [
|
| 68 |
+
"# YOUR CODE HERE\n"
|
| 69 |
+
]
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"cell_type": "markdown",
|
| 73 |
+
"metadata": {},
|
| 74 |
+
"source": [
|
| 75 |
+
"<details>\n",
|
| 76 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"```python\n",
|
| 79 |
+
"from sklearn.preprocessing import LabelEncoder\n",
|
| 80 |
+
"le = LabelEncoder()\n",
|
| 81 |
+
"df['species'] = le.fit_transform(df['species'])\n",
|
| 82 |
+
"\n",
|
| 83 |
+
"df = pd.get_dummies(df, columns=['island', 'sex'], drop_first=True)\n",
|
| 84 |
+
"df.head()\n",
|
| 85 |
+
"```\n",
|
| 86 |
+
"</details>"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"cell_type": "markdown",
|
| 91 |
+
"metadata": {},
|
| 92 |
+
"source": [
|
| 93 |
+
"### Task 2: Split Data\n",
|
| 94 |
+
"Set `species` as target `y` and others as `X`. Split (test_size=0.2)."
|
| 95 |
+
]
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"cell_type": "code",
|
| 99 |
+
"execution_count": null,
|
| 100 |
+
"metadata": {},
|
| 101 |
+
"outputs": [],
|
| 102 |
+
"source": [
|
| 103 |
+
"# YOUR CODE HERE\n"
|
| 104 |
+
]
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
"cell_type": "markdown",
|
| 108 |
+
"metadata": {},
|
| 109 |
+
"source": [
|
| 110 |
+
"<details>\n",
|
| 111 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 112 |
+
"\n",
|
| 113 |
+
"```python\n",
|
| 114 |
+
"X = df.drop('species', axis=1)\n",
|
| 115 |
+
"y = df['species']\n",
|
| 116 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
| 117 |
+
"```\n",
|
| 118 |
+
"</details>"
|
| 119 |
+
]
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"cell_type": "markdown",
|
| 123 |
+
"metadata": {},
|
| 124 |
+
"source": [
|
| 125 |
+
"## 3. Decision Tree\n",
|
| 126 |
+
"\n",
|
| 127 |
+
"### Task 3: Training and Visualizing\n",
|
| 128 |
+
"Train a `DecisionTreeClassifier` and plot the tree structure."
|
| 129 |
+
]
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"cell_type": "code",
|
| 133 |
+
"execution_count": null,
|
| 134 |
+
"metadata": {},
|
| 135 |
+
"outputs": [],
|
| 136 |
+
"source": [
|
| 137 |
+
"# YOUR CODE HERE\n"
|
| 138 |
+
]
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"cell_type": "markdown",
|
| 142 |
+
"metadata": {},
|
| 143 |
+
"source": [
|
| 144 |
+
"<details>\n",
|
| 145 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"```python\n",
|
| 148 |
+
"dt = DecisionTreeClassifier(max_depth=3)\n",
|
| 149 |
+
"dt.fit(X_train, y_train)\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"plt.figure(figsize=(20,10))\n",
|
| 152 |
+
"plot_tree(dt, feature_names=X.columns, class_names=le.classes_, filled=True)\n",
|
| 153 |
+
"plt.show()\n",
|
| 154 |
+
"```\n",
|
| 155 |
+
"</details>"
|
| 156 |
+
]
|
| 157 |
+
},
|
| 158 |
+
{
|
| 159 |
+
"cell_type": "markdown",
|
| 160 |
+
"metadata": {},
|
| 161 |
+
"source": [
|
| 162 |
+
"## 4. Random Forest (Ensemble)\n",
|
| 163 |
+
"\n",
|
| 164 |
+
"### Task 4: Random Forest Classifier\n",
|
| 165 |
+
"Initialize `RandomForestClassifier` with 100 estimators and fit it."
|
| 166 |
+
]
|
| 167 |
+
},
|
| 168 |
+
{
|
| 169 |
+
"cell_type": "code",
|
| 170 |
+
"execution_count": null,
|
| 171 |
+
"metadata": {},
|
| 172 |
+
"outputs": [],
|
| 173 |
+
"source": [
|
| 174 |
+
"# YOUR CODE HERE\n"
|
| 175 |
+
]
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"cell_type": "markdown",
|
| 179 |
+
"metadata": {},
|
| 180 |
+
"source": [
|
| 181 |
+
"<details>\n",
|
| 182 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 183 |
+
"\n",
|
| 184 |
+
"```python\n",
|
| 185 |
+
"rf = RandomForestClassifier(n_estimators=100, random_state=42)\n",
|
| 186 |
+
"rf.fit(X_train, y_train)\n",
|
| 187 |
+
"y_pred = rf.predict(X_test)\n",
|
| 188 |
+
"print(f\"Accuracy: {accuracy_score(y_test, y_pred):.4f}\")\n",
|
| 189 |
+
"```\n",
|
| 190 |
+
"</details>"
|
| 191 |
+
]
|
| 192 |
+
},
|
| 193 |
+
{
|
| 194 |
+
"cell_type": "markdown",
|
| 195 |
+
"metadata": {},
|
| 196 |
+
"source": [
|
| 197 |
+
"### Task 5: Feature Importance\n",
|
| 198 |
+
"Visualize which features contributed most to the Random Forest model."
|
| 199 |
+
]
|
| 200 |
+
},
|
| 201 |
+
{
|
| 202 |
+
"cell_type": "code",
|
| 203 |
+
"execution_count": null,
|
| 204 |
+
"metadata": {},
|
| 205 |
+
"outputs": [],
|
| 206 |
+
"source": [
|
| 207 |
+
"# YOUR CODE HERE\n"
|
| 208 |
+
]
|
| 209 |
+
},
|
| 210 |
+
{
|
| 211 |
+
"cell_type": "markdown",
|
| 212 |
+
"metadata": {},
|
| 213 |
+
"source": [
|
| 214 |
+
"<details>\n",
|
| 215 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 216 |
+
"\n",
|
| 217 |
+
"```python\n",
|
| 218 |
+
"importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)\n",
|
| 219 |
+
"sns.barplot(x=importances, y=importances.index)\n",
|
| 220 |
+
"plt.title('Feature Importances')\n",
|
| 221 |
+
"plt.show()\n",
|
| 222 |
+
"```\n",
|
| 223 |
+
"</details>"
|
| 224 |
+
]
|
| 225 |
+
},
|
| 226 |
+
{
|
| 227 |
+
"cell_type": "markdown",
|
| 228 |
+
"metadata": {},
|
| 229 |
+
"source": [
|
| 230 |
+
"--- \n",
|
| 231 |
+
"### Amazing! \n",
|
| 232 |
+
"You've learned how ensembles can improve performance and how to interpret them. \n",
|
| 233 |
+
"Next module: **Unsupervised Learning (K-Means Clustering)**."
|
| 234 |
+
]
|
| 235 |
+
}
|
| 236 |
+
],
|
| 237 |
+
"metadata": {
|
| 238 |
+
"kernelspec": {
|
| 239 |
+
"display_name": "Python 3",
|
| 240 |
+
"language": "python",
|
| 241 |
+
"name": "python3"
|
| 242 |
+
},
|
| 243 |
+
"language_info": {
|
| 244 |
+
"codemirror_mode": {
|
| 245 |
+
"name": "ipython",
|
| 246 |
+
"version": 3
|
| 247 |
+
},
|
| 248 |
+
"file_extension": ".py",
|
| 249 |
+
"mimetype": "text/x-python",
|
| 250 |
+
"name": "python",
|
| 251 |
+
"nbconvert_exporter": "python",
|
| 252 |
+
"pygments_lexer": "ipython3",
|
| 253 |
+
"version": "3.8.0"
|
| 254 |
+
}
|
| 255 |
+
},
|
| 256 |
+
"nbformat": 4,
|
| 257 |
+
"nbformat_minor": 4
|
| 258 |
+
}
|
ML/14_Gradient_Boosting_XGBoost.ipynb
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 14 - Gradient Boosting & XGBoost\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 14! We're moving into **Boosting**, where we train models sequentially to correct previous errors. This includes **Gradient Boosting** and its optimized version, **XGBoost**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Boosting Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for a comparison of Bagging vs. Boosting and interactive diagrams of residual refinement.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Boosting Principle**: How weak learners become strong learners.\n",
|
| 16 |
+
"2. **XGBoost**: Extreme Gradient Boosting and its hardware efficiency.\n",
|
| 17 |
+
"3. **Tuning**: Learning rates, tree depth, and subsampling.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Setup\n",
|
| 27 |
+
"We will use the **Wine Quality** dataset from Scikit-Learn (regression)."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import pandas as pd\n",
|
| 37 |
+
"import numpy as np\n",
|
| 38 |
+
"from sklearn.datasets import load_wine\n",
|
| 39 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 40 |
+
"from sklearn.ensemble import GradientBoostingClassifier\n",
|
| 41 |
+
"from sklearn.metrics import accuracy_score, classification_report\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"# For XGBoost, you'll need the library installed\n",
|
| 44 |
+
"# (pip install xgboost)\n",
|
| 45 |
+
"import xgboost as xgb\n",
|
| 46 |
+
"\n",
|
| 47 |
+
"# Load dataset\n",
|
| 48 |
+
"wine = load_wine()\n",
|
| 49 |
+
"X = wine.data\n",
|
| 50 |
+
"y = wine.target\n",
|
| 51 |
+
"\n",
|
| 52 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"cell_type": "markdown",
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"source": [
|
| 59 |
+
"## 2. Gradient Boosting\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"### Task 1: Scikit-Learn Gradient Boosting\n",
|
| 62 |
+
"Train a `GradientBoostingClassifier` and evaluate it on the test set."
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "code",
|
| 67 |
+
"execution_count": null,
|
| 68 |
+
"metadata": {},
|
| 69 |
+
"outputs": [],
|
| 70 |
+
"source": [
|
| 71 |
+
"# YOUR CODE HERE\n"
|
| 72 |
+
]
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"cell_type": "markdown",
|
| 76 |
+
"metadata": {},
|
| 77 |
+
"source": [
|
| 78 |
+
"<details>\n",
|
| 79 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"```python\n",
|
| 82 |
+
"gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)\n",
|
| 83 |
+
"gb.fit(X_train, y_train)\n",
|
| 84 |
+
"y_pred = gb.predict(X_test)\n",
|
| 85 |
+
"print(\"GB Accuracy:\", accuracy_score(y_test, y_pred))\n",
|
| 86 |
+
"```\n",
|
| 87 |
+
"</details>"
|
| 88 |
+
]
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"cell_type": "markdown",
|
| 92 |
+
"metadata": {},
|
| 93 |
+
"source": [
|
| 94 |
+
"## 3. XGBoost (The Kaggle Champion)\n",
|
| 95 |
+
"\n",
|
| 96 |
+
"### Task 2: Training XGBoost\n",
|
| 97 |
+
"Use the `XGBClassifier` to train a model and check its performance. Notice the speed advantage.\n",
|
| 98 |
+
"\n",
|
| 99 |
+
"*Web Reference: [XGBoost Section on your site](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*"
|
| 100 |
+
]
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"cell_type": "code",
|
| 104 |
+
"execution_count": null,
|
| 105 |
+
"metadata": {},
|
| 106 |
+
"outputs": [],
|
| 107 |
+
"source": [
|
| 108 |
+
"# YOUR CODE HERE\n"
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"cell_type": "markdown",
|
| 113 |
+
"metadata": {},
|
| 114 |
+
"source": [
|
| 115 |
+
"<details>\n",
|
| 116 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"```python\n",
|
| 119 |
+
"xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, use_label_encoder=False, eval_metric='mlogloss')\n",
|
| 120 |
+
"xgb_model.fit(X_train, y_train)\n",
|
| 121 |
+
"y_pred_xgb = xgb_model.predict(X_test)\n",
|
| 122 |
+
"print(\"XGB Accuracy:\", accuracy_score(y_test, y_pred_xgb))\n",
|
| 123 |
+
"```\n",
|
| 124 |
+
"</details>"
|
| 125 |
+
]
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"cell_type": "markdown",
|
| 129 |
+
"metadata": {},
|
| 130 |
+
"source": [
|
| 131 |
+
"--- \n",
|
| 132 |
+
"### Power Move! \n",
|
| 133 |
+
"You've learned how to harness Gradient Boosting. These models are often the most accurate for structured data.\n",
|
| 134 |
+
"Next: **Dimensionality Reduction (PCA)**."
|
| 135 |
+
]
|
| 136 |
+
}
|
| 137 |
+
],
|
| 138 |
+
"metadata": {
|
| 139 |
+
"kernelspec": {
|
| 140 |
+
"display_name": "Python 3",
|
| 141 |
+
"language": "python",
|
| 142 |
+
"name": "python3"
|
| 143 |
+
},
|
| 144 |
+
"language_info": {
|
| 145 |
+
"codemirror_mode": {
|
| 146 |
+
"name": "ipython",
|
| 147 |
+
"version": 3
|
| 148 |
+
},
|
| 149 |
+
"file_extension": ".py",
|
| 150 |
+
"mimetype": "text/x-python",
|
| 151 |
+
"name": "python",
|
| 152 |
+
"nbconvert_exporter": "python",
|
| 153 |
+
"pygments_lexer": "ipython3",
|
| 154 |
+
"version": "3.12.7"
|
| 155 |
+
}
|
| 156 |
+
},
|
| 157 |
+
"nbformat": 4,
|
| 158 |
+
"nbformat_minor": 4
|
| 159 |
+
}
|
ML/15_KMeans_Clustering.ipynb
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 05 - K-Means Clustering\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to the final module of this basic series! We are exploring **Unsupervised Learning** with **K-Means Clustering**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Unsupervised Learning**: Pattern discovery without labels.\n",
|
| 13 |
+
"2. **K-Means**: How the algorithm groups data.\n",
|
| 14 |
+
"3. **Elbow Method**: Deciding the number of clusters (K).\n",
|
| 15 |
+
"\n",
|
| 16 |
+
"---"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"metadata": {},
|
| 22 |
+
"source": [
|
| 23 |
+
"## 1. Setup\n",
|
| 24 |
+
"We will generate a synthetic dataset for this exercise to clearly see the clusters."
|
| 25 |
+
]
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"cell_type": "code",
|
| 29 |
+
"execution_count": null,
|
| 30 |
+
"metadata": {},
|
| 31 |
+
"outputs": [],
|
| 32 |
+
"source": [
|
| 33 |
+
"import pandas as pd\n",
|
| 34 |
+
"import numpy as np\n",
|
| 35 |
+
"import matplotlib.pyplot as plt\n",
|
| 36 |
+
"import seaborn as sns\n",
|
| 37 |
+
"from sklearn.cluster import KMeans\n",
|
| 38 |
+
"from sklearn.datasets import make_blobs\n",
|
| 39 |
+
"\n",
|
| 40 |
+
"# Generate synthetic data\n",
|
| 41 |
+
"X, _ = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)\n",
|
| 42 |
+
"df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])\n",
|
| 43 |
+
"\n",
|
| 44 |
+
"plt.scatter(df['Feature 1'], df['Feature 2'], s=30, alpha=0.5)\n",
|
| 45 |
+
"plt.title(\"Original Data (Unlabeled)\")\n",
|
| 46 |
+
"plt.show()"
|
| 47 |
+
]
|
| 48 |
+
},
|
| 49 |
+
{
|
| 50 |
+
"cell_type": "markdown",
|
| 51 |
+
"metadata": {},
|
| 52 |
+
"source": [
|
| 53 |
+
"## 2. K-Means Implementation\n",
|
| 54 |
+
"\n",
|
| 55 |
+
"### Task 1: Find Optimal K (Elbow Method)\n",
|
| 56 |
+
"Calculate inertia (Within-Cluster Sum of Squares) for K values from 1 to 10."
|
| 57 |
+
]
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"cell_type": "code",
|
| 61 |
+
"execution_count": null,
|
| 62 |
+
"metadata": {},
|
| 63 |
+
"outputs": [],
|
| 64 |
+
"source": [
|
| 65 |
+
"# YOUR CODE HERE\n"
|
| 66 |
+
]
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"cell_type": "markdown",
|
| 70 |
+
"metadata": {},
|
| 71 |
+
"source": [
|
| 72 |
+
"<details>\n",
|
| 73 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 74 |
+
"\n",
|
| 75 |
+
"```python\n",
|
| 76 |
+
"inertia = []\n",
|
| 77 |
+
"for k in range(1, 11):\n",
|
| 78 |
+
" kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)\n",
|
| 79 |
+
" kmeans.fit(X)\n",
|
| 80 |
+
" inertia.append(kmeans.inertia_)\n",
|
| 81 |
+
"\n",
|
| 82 |
+
"plt.plot(range(1, 11), inertia, 'bx-')\n",
|
| 83 |
+
"plt.xlabel('K values')\n",
|
| 84 |
+
"plt.ylabel('Inertia')\n",
|
| 85 |
+
"plt.title('Elbow Method')\n",
|
| 86 |
+
"plt.show()\n",
|
| 87 |
+
"```\n",
|
| 88 |
+
"</details>"
|
| 89 |
+
]
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"cell_type": "markdown",
|
| 93 |
+
"metadata": {},
|
| 94 |
+
"source": [
|
| 95 |
+
"### Task 2: Fit K-Means\n",
|
| 96 |
+
"From the elbow plot, choose the best K (looks like 4) and fit the model."
|
| 97 |
+
]
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"cell_type": "code",
|
| 101 |
+
"execution_count": null,
|
| 102 |
+
"metadata": {},
|
| 103 |
+
"outputs": [],
|
| 104 |
+
"source": [
|
| 105 |
+
"# YOUR CODE HERE\n"
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "markdown",
|
| 110 |
+
"metadata": {},
|
| 111 |
+
"source": [
|
| 112 |
+
"<details>\n",
|
| 113 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 114 |
+
"\n",
|
| 115 |
+
"```python\n",
|
| 116 |
+
"kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)\n",
|
| 117 |
+
"df['cluster'] = kmeans.fit_predict(X)\n",
|
| 118 |
+
"```\n",
|
| 119 |
+
"</details>"
|
| 120 |
+
]
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"cell_type": "markdown",
|
| 124 |
+
"metadata": {},
|
| 125 |
+
"source": [
|
| 126 |
+
"### Task 3: Visualize Clusters\n",
|
| 127 |
+
"Scatter plot again, but color points by their assigned cluster."
|
| 128 |
+
]
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"cell_type": "code",
|
| 132 |
+
"execution_count": null,
|
| 133 |
+
"metadata": {},
|
| 134 |
+
"outputs": [],
|
| 135 |
+
"source": [
|
| 136 |
+
"# YOUR CODE HERE\n"
|
| 137 |
+
]
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"cell_type": "markdown",
|
| 141 |
+
"metadata": {},
|
| 142 |
+
"source": [
|
| 143 |
+
"<details>\n",
|
| 144 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 145 |
+
"\n",
|
| 146 |
+
"```python\n",
|
| 147 |
+
"plt.scatter(df['Feature 1'], df['Feature 2'], c=df['cluster'], cmap='viridis', s=30)\n",
|
| 148 |
+
"plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=200, marker='X', label='Centroids')\n",
|
| 149 |
+
"plt.legend()\n",
|
| 150 |
+
"plt.title(\"Clustered Data\")\n",
|
| 151 |
+
"plt.show()\n",
|
| 152 |
+
"```\n",
|
| 153 |
+
"</details>"
|
| 154 |
+
]
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"cell_type": "markdown",
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"source": [
|
| 160 |
+
"--- \n",
|
| 161 |
+
"### Congratulations! \n",
|
| 162 |
+
"You've completed the foundational Machine Learning practice series. \n",
|
| 163 |
+
"You now have hands-on experience with:\n",
|
| 164 |
+
"1. EDA & Feature Engineering\n",
|
| 165 |
+
"2. Linear Regression\n",
|
| 166 |
+
"3. Logistic Regression\n",
|
| 167 |
+
"4. Decision Trees & Random Forests\n",
|
| 168 |
+
"5. K-Means Clustering\n",
|
| 169 |
+
"\n",
|
| 170 |
+
"Keep practicing with new datasets!"
|
| 171 |
+
]
|
| 172 |
+
}
|
| 173 |
+
],
|
| 174 |
+
"metadata": {
|
| 175 |
+
"kernelspec": {
|
| 176 |
+
"display_name": "Python 3",
|
| 177 |
+
"language": "python",
|
| 178 |
+
"name": "python3"
|
| 179 |
+
},
|
| 180 |
+
"language_info": {
|
| 181 |
+
"codemirror_mode": {
|
| 182 |
+
"name": "ipython",
|
| 183 |
+
"version": 3
|
| 184 |
+
},
|
| 185 |
+
"file_extension": ".py",
|
| 186 |
+
"mimetype": "text/x-python",
|
| 187 |
+
"name": "python",
|
| 188 |
+
"nbconvert_exporter": "python",
|
| 189 |
+
"pygments_lexer": "ipython3",
|
| 190 |
+
"version": "3.8.0"
|
| 191 |
+
}
|
| 192 |
+
},
|
| 193 |
+
"nbformat": 4,
|
| 194 |
+
"nbformat_minor": 4
|
| 195 |
+
}
|
ML/16_Dimensionality_Reduction_PCA.ipynb
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 15 - Dimensionality Reduction (PCA)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 15! We're exploring **PCA (Principal Component Analysis)**, a technique for reducing the number of variables in your data while preserving as much information as possible.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Refer to the **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section on your hub for the Linear Algebra (Eigenvalues/Eigenvectors) behind PCA.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Information Compression**: Reducing features without losing pattern labels.\n",
|
| 16 |
+
"2. **Visualization**: Plotting high-dimensional data in 2D or 3D.\n",
|
| 17 |
+
"3. **Explained Variance**: Understanding how many components we actually need.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Setup\n",
|
| 27 |
+
"We will use the **Digits** dataset (8x8 images of handwritten digits) which flattened has 64 features."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import pandas as pd\n",
|
| 37 |
+
"import numpy as np\n",
|
| 38 |
+
"import matplotlib.pyplot as plt\n",
|
| 39 |
+
"import seaborn as sns\n",
|
| 40 |
+
"from sklearn.datasets import load_digits\n",
|
| 41 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
| 42 |
+
"from sklearn.decomposition import PCA\n",
|
| 43 |
+
"\n",
|
| 44 |
+
"# Load dataset\n",
|
| 45 |
+
"digits = load_digits()\n",
|
| 46 |
+
"X = digits.data\n",
|
| 47 |
+
"y = digits.target\n",
|
| 48 |
+
"\n",
|
| 49 |
+
"print(\"Original Shape:\", X.shape)"
|
| 50 |
+
]
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"cell_type": "markdown",
|
| 54 |
+
"metadata": {},
|
| 55 |
+
"source": [
|
| 56 |
+
"## 2. Visualization via PCA\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"### Task 1: 2D Projection\n",
|
| 59 |
+
"Reduce the 64 features down to 2 and visualize the digits on a scatter plot.\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"*Web Reference: Check [Data Visualization](https://aashishgarg13.github.io/DataScience/Visualization/) for how to present these results.*"
|
| 62 |
+
]
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"cell_type": "code",
|
| 66 |
+
"execution_count": null,
|
| 67 |
+
"metadata": {},
|
| 68 |
+
"outputs": [],
|
| 69 |
+
"source": [
|
| 70 |
+
"# YOUR CODE HERE\n"
|
| 71 |
+
]
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"cell_type": "markdown",
|
| 75 |
+
"metadata": {},
|
| 76 |
+
"source": [
|
| 77 |
+
"<details>\n",
|
| 78 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 79 |
+
"\n",
|
| 80 |
+
"```python\n",
|
| 81 |
+
"scaler = StandardScaler()\n",
|
| 82 |
+
"X_scaled = scaler.fit_transform(X)\n",
|
| 83 |
+
"\n",
|
| 84 |
+
"pca = PCA(n_components=2)\n",
|
| 85 |
+
"X_pca = pca.fit_transform(X_scaled)\n",
|
| 86 |
+
"\n",
|
| 87 |
+
"plt.figure(figsize=(10, 8))\n",
|
| 88 |
+
"plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)\n",
|
| 89 |
+
"plt.colorbar(label='Digit Label')\n",
|
| 90 |
+
"plt.title('Digits Dataset: 64D flattened to 2D via PCA')\n",
|
| 91 |
+
"plt.xlabel('PC1')\n",
|
| 92 |
+
"plt.ylabel('PC2')\n",
|
| 93 |
+
"plt.show()\n",
|
| 94 |
+
"```\n",
|
| 95 |
+
"</details>"
|
| 96 |
+
]
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"cell_type": "markdown",
|
| 100 |
+
"metadata": {},
|
| 101 |
+
"source": [
|
| 102 |
+
"## 3. Selecting Components\n",
|
| 103 |
+
"\n",
|
| 104 |
+
"### Task 2: Scree Plot\n",
|
| 105 |
+
"Calculate the cumulative explained variance for all components and identify how many are needed to keep 95% of the information."
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "code",
|
| 110 |
+
"execution_count": null,
|
| 111 |
+
"metadata": {},
|
| 112 |
+
"outputs": [],
|
| 113 |
+
"source": [
|
| 114 |
+
"# YOUR CODE HERE\n"
|
| 115 |
+
]
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"cell_type": "markdown",
|
| 119 |
+
"metadata": {},
|
| 120 |
+
"source": [
|
| 121 |
+
"<details>\n",
|
| 122 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"```python\n",
|
| 125 |
+
"pca_full = PCA().fit(X_scaled)\n",
|
| 126 |
+
"plt.plot(np.cumsum(pca_full.explained_variance_ratio_))\n",
|
| 127 |
+
"plt.xlabel('Number of Components')\n",
|
| 128 |
+
"plt.ylabel('Cumulative Explained Variance')\n",
|
| 129 |
+
"plt.axhline(y=0.95, color='r', linestyle='--')\n",
|
| 130 |
+
"plt.title('Scree Plot: Finding the Elbow')\n",
|
| 131 |
+
"plt.show()\n",
|
| 132 |
+
"```\n",
|
| 133 |
+
"</details>"
|
| 134 |
+
]
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"cell_type": "markdown",
|
| 138 |
+
"metadata": {},
|
| 139 |
+
"source": [
|
| 140 |
+
"--- \n",
|
| 141 |
+
"### Excellent Compression! \n",
|
| 142 |
+
"You've learned how to simplify complex data without losing the big picture.\n",
|
| 143 |
+
"Next: **Advanced Clustering (DBSCAN & Hierarchical)**."
|
| 144 |
+
]
|
| 145 |
+
}
|
| 146 |
+
],
|
| 147 |
+
"metadata": {
|
| 148 |
+
"kernelspec": {
|
| 149 |
+
"display_name": "Python 3",
|
| 150 |
+
"language": "python",
|
| 151 |
+
"name": "python3"
|
| 152 |
+
},
|
| 153 |
+
"language_info": {
|
| 154 |
+
"codemirror_mode": {
|
| 155 |
+
"name": "ipython",
|
| 156 |
+
"version": 3
|
| 157 |
+
},
|
| 158 |
+
"file_extension": ".py",
|
| 159 |
+
"mimetype": "text/x-python",
|
| 160 |
+
"name": "python",
|
| 161 |
+
"nbconvert_exporter": "python",
|
| 162 |
+
"pygments_lexer": "ipython3",
|
| 163 |
+
"version": "3.12.7"
|
| 164 |
+
}
|
| 165 |
+
},
|
| 166 |
+
"nbformat": 4,
|
| 167 |
+
"nbformat_minor": 4
|
| 168 |
+
}
|
ML/17_Neural_Networks_Deep_Learning.ipynb
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 16 - Neural Networks (Deep Learning Foundations)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 16! We are entering the world of **Deep Learning**. We'll start with the building block of all neural networks: the **Perceptron** and the **Multi-Layer Perceptron (MLP)**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Visit your hub's **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section to review Calculus (Backpropagation/Partial Derivatives) which is the engine of Deep Learning.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Neural Network Architecture**: Inputs, Hidden Layers, and Outputs.\n",
|
| 16 |
+
"2. **Activation Functions**: Sigmoid, ReLU, and Softmax.\n",
|
| 17 |
+
"3. **Training Process**: Forward Propagation & Backpropagation.\n",
|
| 18 |
+
"4. **Optimization**: Stochastic Gradient Descent (SGD) and Adam.\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"---"
|
| 21 |
+
]
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"cell_type": "markdown",
|
| 25 |
+
"metadata": {},
|
| 26 |
+
"source": [
|
| 27 |
+
"## 1. Setup\n",
|
| 28 |
+
"We will use the **MNIST** dataset (Handwritten digits) but via Scikit-Learn's easy-to-use MLP interface for this foundation module."
|
| 29 |
+
]
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
"cell_type": "code",
|
| 33 |
+
"execution_count": null,
|
| 34 |
+
"metadata": {},
|
| 35 |
+
"outputs": [],
|
| 36 |
+
"source": [
|
| 37 |
+
"import matplotlib.pyplot as plt\n",
|
| 38 |
+
"from sklearn.datasets import fetch_openml\n",
|
| 39 |
+
"from sklearn.neural_network import MLPClassifier\n",
|
| 40 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 41 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
| 42 |
+
"from sklearn.metrics import classification_report, confusion_matrix\n",
|
| 43 |
+
"\n",
|
| 44 |
+
"# Load digits (MNIST small version)\n",
|
| 45 |
+
"X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False, parser='auto')\n",
|
| 46 |
+
"\n",
|
| 47 |
+
"# Use a subset for speed in practice\n",
|
| 48 |
+
"X = X[:5000] / 255.0\n",
|
| 49 |
+
"y = y[:5000]\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
| 52 |
+
"print(\"Training Shape:\", X_train.shape)"
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"cell_type": "markdown",
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"source": [
|
| 59 |
+
"## 2. Multi-Layer Perceptron (MLP)\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"### Task 1: Building the Network\n",
|
| 62 |
+
"Configure an `MLPClassifier` with:\n",
|
| 63 |
+
"1. Two hidden layers (size 50 each).\n",
|
| 64 |
+
"2. 'relu' activation function.\n",
|
| 65 |
+
"3. 'adam' solver.\n",
|
| 66 |
+
"4. Max 20 iterations to start."
|
| 67 |
+
]
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"cell_type": "code",
|
| 71 |
+
"execution_count": null,
|
| 72 |
+
"metadata": {},
|
| 73 |
+
"outputs": [],
|
| 74 |
+
"source": [
|
| 75 |
+
"# YOUR CODE HERE\n"
|
| 76 |
+
]
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"cell_type": "markdown",
|
| 80 |
+
"metadata": {},
|
| 81 |
+
"source": [
|
| 82 |
+
"<details>\n",
|
| 83 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 84 |
+
"\n",
|
| 85 |
+
"```python\n",
|
| 86 |
+
"mlp = MLPClassifier(hidden_layer_sizes=(50, 50), max_iter=20, alpha=1e-4,\n",
|
| 87 |
+
" solver='adam', verbose=10, random_state=1, \n",
|
| 88 |
+
" learning_rate_init=.1)\n",
|
| 89 |
+
"mlp.fit(X_train, y_train)\n",
|
| 90 |
+
"```\n",
|
| 91 |
+
"</details>"
|
| 92 |
+
]
|
| 93 |
+
},
|
| 94 |
+
{
|
| 95 |
+
"cell_type": "markdown",
|
| 96 |
+
"metadata": {},
|
| 97 |
+
"source": [
|
| 98 |
+
"## 3. Detailed Evaluation\n",
|
| 99 |
+
"\n",
|
| 100 |
+
"### Task 2: Confusion Matrix\n",
|
| 101 |
+
"Neural networks can often confuse similar digits (like 4 and 9). Plot the confusion matrix to see where your model is struggling."
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "code",
|
| 106 |
+
"execution_count": null,
|
| 107 |
+
"metadata": {},
|
| 108 |
+
"outputs": [],
|
| 109 |
+
"source": [
|
| 110 |
+
"import seaborn as sns\n",
|
| 111 |
+
"\n",
|
| 112 |
+
"# YOUR CODE HERE\n"
|
| 113 |
+
]
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"cell_type": "markdown",
|
| 117 |
+
"metadata": {},
|
| 118 |
+
"source": [
|
| 119 |
+
"<details>\n",
|
| 120 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"```python\n",
|
| 123 |
+
"y_pred = mlp.predict(X_test)\n",
|
| 124 |
+
"cm = confusion_matrix(y_test, y_pred)\n",
|
| 125 |
+
"plt.figure(figsize=(10,7))\n",
|
| 126 |
+
"sns.heatmap(cm, annot=True, fmt='d', cmap='Oranges')\n",
|
| 127 |
+
"plt.xlabel('Predicted')\n",
|
| 128 |
+
"plt.ylabel('Actual')\n",
|
| 129 |
+
"plt.show()\n",
|
| 130 |
+
"```\n",
|
| 131 |
+
"</details>"
|
| 132 |
+
]
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"cell_type": "markdown",
|
| 136 |
+
"metadata": {},
|
| 137 |
+
"source": [
|
| 138 |
+
"--- \n",
|
| 139 |
+
"### Congratulations! \n",
|
| 140 |
+
"You've trained your first Neural Network. This is the foundation for Computer Vision and NLP.\n",
|
| 141 |
+
"Next: **Reinforcement Learning**."
|
| 142 |
+
]
|
| 143 |
+
}
|
| 144 |
+
],
|
| 145 |
+
"metadata": {
|
| 146 |
+
"kernelspec": {
|
| 147 |
+
"display_name": "Python 3",
|
| 148 |
+
"language": "python",
|
| 149 |
+
"name": "python3"
|
| 150 |
+
},
|
| 151 |
+
"language_info": {
|
| 152 |
+
"codemirror_mode": {
|
| 153 |
+
"name": "ipython",
|
| 154 |
+
"version": 3
|
| 155 |
+
},
|
| 156 |
+
"file_extension": ".py",
|
| 157 |
+
"mimetype": "text/x-python",
|
| 158 |
+
"name": "python",
|
| 159 |
+
"nbconvert_exporter": "python",
|
| 160 |
+
"pygments_lexer": "ipython3",
|
| 161 |
+
"version": "3.12.7"
|
| 162 |
+
}
|
| 163 |
+
},
|
| 164 |
+
"nbformat": 4,
|
| 165 |
+
"nbformat_minor": 4
|
| 166 |
+
}
|
ML/18_Time_Series_Analysis.ipynb
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 18 - Time Series Analysis\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 18! **Time Series Analysis** is the study of data points collected or recorded at specific time intervals. This is crucial for finance, weather forecasting, and inventory management.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Datetime Handling**: Converting strings to date objects.\n",
|
| 13 |
+
"2. **Resampling & Rolling Windows**: Smoothing data trends.\n",
|
| 14 |
+
"3. **Stationarity**: Understanding the Mean and Variance over time.\n",
|
| 15 |
+
"4. **Forecasting**: A simple look at the Moving Average model.\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"---"
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"cell_type": "markdown",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"## 1. Setup\n",
|
| 25 |
+
"We will use the **Air Passengers** dataset, which shows monthly totals of international airline passengers from 1949 to 1960."
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "code",
|
| 30 |
+
"execution_count": null,
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"outputs": [],
|
| 33 |
+
"source": [
|
| 34 |
+
"import pandas as pd\n",
|
| 35 |
+
"import numpy as np\n",
|
| 36 |
+
"import matplotlib.pyplot as plt\n",
|
| 37 |
+
"import seaborn as sns\n",
|
| 38 |
+
"\n",
|
| 39 |
+
"# Load dataset\n",
|
| 40 |
+
"url = \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv\"\n",
|
| 41 |
+
"df = pd.read_csv(url, parse_dates=['Month'], index_index=True)\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"print(\"Dataset head:\")\n",
|
| 44 |
+
"print(df.head())\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"plt.figure(figsize=(12, 6))\n",
|
| 47 |
+
"plt.plot(df)\n",
|
| 48 |
+
"plt.title('Monthly International Airline Passengers')\n",
|
| 49 |
+
"plt.show()"
|
| 50 |
+
]
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"cell_type": "markdown",
|
| 54 |
+
"metadata": {},
|
| 55 |
+
"source": [
|
| 56 |
+
"## 2. Feature Extraction from Time\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"### Task 1: Component Extraction\n",
|
| 59 |
+
"Extract the `Year`, `Month`, and `Day of Week` from the index into new columns.\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"*Web Reference: [Feature Engineering Guide](https://aashishgarg13.github.io/DataScience/feature-engineering/) (Time features section).*"
|
| 62 |
+
]
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"cell_type": "code",
|
| 66 |
+
"execution_count": null,
|
| 67 |
+
"metadata": {},
|
| 68 |
+
"outputs": [],
|
| 69 |
+
"source": [
|
| 70 |
+
"# YOUR CODE HERE\n"
|
| 71 |
+
]
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"cell_type": "markdown",
|
| 75 |
+
"metadata": {},
|
| 76 |
+
"source": [
|
| 77 |
+
"<details>\n",
|
| 78 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 79 |
+
"\n",
|
| 80 |
+
"```python\n",
|
| 81 |
+
"df['year'] = df.index.year\n",
|
| 82 |
+
"df['month'] = df.index.month\n",
|
| 83 |
+
"df['day_of_week'] = df.index.dayofweek\n",
|
| 84 |
+
"df.head()\n",
|
| 85 |
+
"```\n",
|
| 86 |
+
"</details>"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"cell_type": "markdown",
|
| 91 |
+
"metadata": {},
|
| 92 |
+
"source": [
|
| 93 |
+
"## 3. Smoothing Trends\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"### Task 2: Rolling Mean\n",
|
| 96 |
+
"Calculate and plot a 12-month rolling mean to see the yearly trend more clearly."
|
| 97 |
+
]
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"cell_type": "code",
|
| 101 |
+
"execution_count": null,
|
| 102 |
+
"metadata": {},
|
| 103 |
+
"outputs": [],
|
| 104 |
+
"source": [
|
| 105 |
+
"# YOUR CODE HERE\n"
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "markdown",
|
| 110 |
+
"metadata": {},
|
| 111 |
+
"source": [
|
| 112 |
+
"<details>\n",
|
| 113 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 114 |
+
"\n",
|
| 115 |
+
"```python\n",
|
| 116 |
+
"rolling_mean = df['Passengers'].rolling(window=12).mean()\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"plt.figure(figsize=(12, 6))\n",
|
| 119 |
+
"plt.plot(df['Passengers'], label='Original')\n",
|
| 120 |
+
"plt.plot(rolling_mean, color='red', label='12-Month Rolling Mean')\n",
|
| 121 |
+
"plt.legend()\n",
|
| 122 |
+
"plt.show()\n",
|
| 123 |
+
"```\n",
|
| 124 |
+
"</details>"
|
| 125 |
+
]
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"cell_type": "markdown",
|
| 129 |
+
"metadata": {},
|
| 130 |
+
"source": [
|
| 131 |
+
"--- \n",
|
| 132 |
+
"### Excellent Forecast! \n",
|
| 133 |
+
"Time Series is a deep field. You've now mastered the basics of handling temporal data.\n",
|
| 134 |
+
"Next: **Natural Language Processing (NLP)**."
|
| 135 |
+
]
|
| 136 |
+
}
|
| 137 |
+
],
|
| 138 |
+
"metadata": {
|
| 139 |
+
"kernelspec": {
|
| 140 |
+
"display_name": "Python 3",
|
| 141 |
+
"language": "python",
|
| 142 |
+
"name": "python3"
|
| 143 |
+
},
|
| 144 |
+
"language_info": {
|
| 145 |
+
"codemirror_mode": {
|
| 146 |
+
"name": "ipython",
|
| 147 |
+
"version": 3
|
| 148 |
+
},
|
| 149 |
+
"file_extension": ".py",
|
| 150 |
+
"mimetype": "text/x-python",
|
| 151 |
+
"name": "python",
|
| 152 |
+
"nbconvert_exporter": "python",
|
| 153 |
+
"pygments_lexer": "ipython3",
|
| 154 |
+
"version": "3.12.7"
|
| 155 |
+
}
|
| 156 |
+
},
|
| 157 |
+
"nbformat": 4,
|
| 158 |
+
"nbformat_minor": 4
|
| 159 |
+
}
|
ML/19_Natural_Language_Processing_NLP.ipynb
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 19 - Natural Language Processing (NLP)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 19! **Natural Language Processing** allows machines to understand, interpret, and generate human language. This is the tech behind Siri, Google Translate, and ChatGPT.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Text Cleaning**: Removing punctuation and stopwords.\n",
|
| 13 |
+
"2. **Tokenization & Lemmatization**: Breaking down words to their roots.\n",
|
| 14 |
+
"3. **TF-IDF**: Weighing word importance in a document.\n",
|
| 15 |
+
"4. **Sentiment Analysis**: Predicting if a text is positive or negative.\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"---"
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"cell_type": "markdown",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"## 1. Setup\n",
|
| 25 |
+
"We will use a dataset of movie reviews to perform sentiment analysis."
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "code",
|
| 30 |
+
"execution_count": null,
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"outputs": [],
|
| 33 |
+
"source": [
|
| 34 |
+
"import pandas as pd\n",
|
| 35 |
+
"import numpy as np\n",
|
| 36 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 37 |
+
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
| 38 |
+
"from sklearn.linear_model import LogisticRegression\n",
|
| 39 |
+
"from sklearn.metrics import accuracy_score\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"# Sample Dataset\n",
|
| 42 |
+
"reviews = [\n",
|
| 43 |
+
" (\"I loved this movie! The acting was great.\", 1),\n",
|
| 44 |
+
" (\"Terrible film, a complete waste of time.\", 0),\n",
|
| 45 |
+
" (\"The plot was boring but the music was okay.\", 0),\n",
|
| 46 |
+
" (\"Truly a masterpiece of cinema.\", 1),\n",
|
| 47 |
+
" (\"I would not recommend this to anybody.\", 0),\n",
|
| 48 |
+
" (\"Best experience I have had in a theater.\", 1)\n",
|
| 49 |
+
"]\n",
|
| 50 |
+
"df = pd.DataFrame(reviews, columns=['text', 'sentiment'])\n",
|
| 51 |
+
"df"
|
| 52 |
+
]
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"cell_type": "markdown",
|
| 56 |
+
"metadata": {},
|
| 57 |
+
"source": [
|
| 58 |
+
"## 2. Text Transformation\n",
|
| 59 |
+
"\n",
|
| 60 |
+
"### Task 1: TF-IDF Vectorization\n",
|
| 61 |
+
"Convert the text reviews into a numerical matrix using `TfidfVectorizer` (Term Frequency-Inverse Document Frequency).\n",
|
| 62 |
+
"\n",
|
| 63 |
+
"*Web Reference: [ML Guide - NLP Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)*"
|
| 64 |
+
]
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"cell_type": "code",
|
| 68 |
+
"execution_count": null,
|
| 69 |
+
"metadata": {},
|
| 70 |
+
"outputs": [],
|
| 71 |
+
"source": [
|
| 72 |
+
"# YOUR CODE HERE\n"
|
| 73 |
+
]
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"cell_type": "markdown",
|
| 77 |
+
"metadata": {},
|
| 78 |
+
"source": [
|
| 79 |
+
"<details>\n",
|
| 80 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 81 |
+
"\n",
|
| 82 |
+
"```python\n",
|
| 83 |
+
"tfidf = TfidfVectorizer(stop_words='english')\n",
|
| 84 |
+
"X = tfidf.fit_transform(df['text'])\n",
|
| 85 |
+
"y = df['sentiment']\n",
|
| 86 |
+
"print(\"Feature names:\", tfidf.get_feature_names_out()[:10])\n",
|
| 87 |
+
"```\n",
|
| 88 |
+
"</details>"
|
| 89 |
+
]
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"cell_type": "markdown",
|
| 93 |
+
"metadata": {},
|
| 94 |
+
"source": [
|
| 95 |
+
"## 3. Sentiment Classification\n",
|
| 96 |
+
"\n",
|
| 97 |
+
"### Task 2: Training the Classifier\n",
|
| 98 |
+
"Train a `LogisticRegression` model on the TF-IDF matrix and predict the sentiment of: \"This was a really fun movie!\""
|
| 99 |
+
]
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"cell_type": "code",
|
| 103 |
+
"execution_count": null,
|
| 104 |
+
"metadata": {},
|
| 105 |
+
"outputs": [],
|
| 106 |
+
"source": [
|
| 107 |
+
"# YOUR CODE HERE\n"
|
| 108 |
+
]
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"cell_type": "markdown",
|
| 112 |
+
"metadata": {},
|
| 113 |
+
"source": [
|
| 114 |
+
"<details>\n",
|
| 115 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 116 |
+
"\n",
|
| 117 |
+
"```python\n",
|
| 118 |
+
"model = LogisticRegression()\n",
|
| 119 |
+
"model.fit(X, y)\n",
|
| 120 |
+
"\n",
|
| 121 |
+
"new_review = [\"This was a really fun movie!\"]\n",
|
| 122 |
+
"new_vec = tfidf.transform(new_review)\n",
|
| 123 |
+
"pred = model.predict(new_vec)\n",
|
| 124 |
+
"\n",
|
| 125 |
+
"print(\"Positive\" if pred[0] == 1 else \"Negative\")\n",
|
| 126 |
+
"```\n",
|
| 127 |
+
"</details>"
|
| 128 |
+
]
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"cell_type": "markdown",
|
| 132 |
+
"metadata": {},
|
| 133 |
+
"source": [
|
| 134 |
+
"--- \n",
|
| 135 |
+
"### NLP Mission Accomplished! \n",
|
| 136 |
+
"You've learned how to turn human language into math. \n",
|
| 137 |
+
"This is your final module in the core series!"
|
| 138 |
+
]
|
| 139 |
+
}
|
| 140 |
+
],
|
| 141 |
+
"metadata": {
|
| 142 |
+
"kernelspec": {
|
| 143 |
+
"display_name": "Python 3",
|
| 144 |
+
"language": "python",
|
| 145 |
+
"name": "python3"
|
| 146 |
+
},
|
| 147 |
+
"language_info": {
|
| 148 |
+
"codemirror_mode": {
|
| 149 |
+
"name": "ipython",
|
| 150 |
+
"version": 3
|
| 151 |
+
},
|
| 152 |
+
"file_extension": ".py",
|
| 153 |
+
"mimetype": "text/x-python",
|
| 154 |
+
"name": "python",
|
| 155 |
+
"nbconvert_exporter": "python",
|
| 156 |
+
"pygments_lexer": "ipython3",
|
| 157 |
+
"version": "3.12.7"
|
| 158 |
+
}
|
| 159 |
+
},
|
| 160 |
+
"nbformat": 4,
|
| 161 |
+
"nbformat_minor": 4
|
| 162 |
+
}
|
ML/20_Reinforcement_Learning_Basics.ipynb
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 17 - Reinforcement Learning (Q-Learning)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to Module 17! We are exploring **Reinforcement Learning** (RL). Unlike supervised learning, RL agents learn by interacting with an environment and receiving rewards or penalties.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Resources:\n",
|
| 12 |
+
"Check out the **[Q-Learning Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for a breakdown of the Bellman Equation ($Q(s,a)$) and how the Agent-Environment loop works.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Objectives:\n",
|
| 15 |
+
"1. **Agent-Environment Loop**: States, Actions, and Rewards.\n",
|
| 16 |
+
"2. **Exploration vs. Exploitation**: The Epsilon-Greedy strategy.\n",
|
| 17 |
+
"3. **Q-Table**: Learning the quality of actions.\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Environment Simulation\n",
|
| 27 |
+
"We will implement a simple \"Grid World\" where an agent has to find a treasure while avoiding traps."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import numpy as np\n",
|
| 37 |
+
"import matplotlib.pyplot as plt\n",
|
| 38 |
+
"\n",
|
| 39 |
+
"class SimpleGridWorld:\n",
|
| 40 |
+
" def __init__(self, size=5):\n",
|
| 41 |
+
" self.size = size\n",
|
| 42 |
+
" self.state = (0, 0)\n",
|
| 43 |
+
" self.goal = (size-1, size-1)\n",
|
| 44 |
+
" self.trap = (size//2, size//2)\n",
|
| 45 |
+
" \n",
|
| 46 |
+
" def step(self, action):\n",
|
| 47 |
+
" # 0=Up, 1=Down, 2=Left, 3=Right\n",
|
| 48 |
+
" r, c = self.state\n",
|
| 49 |
+
" if action == 0: r = max(0, r-1)\n",
|
| 50 |
+
" elif action == 1: r = min(self.size-1, r+1)\n",
|
| 51 |
+
" elif action == 2: c = max(0, c-1)\n",
|
| 52 |
+
" elif action == 3: c = min(self.size-1, c+1)\n",
|
| 53 |
+
" \n",
|
| 54 |
+
" self.state = (r, c)\n",
|
| 55 |
+
" \n",
|
| 56 |
+
" if self.state == self.goal:\n",
|
| 57 |
+
" return self.state, 10, True\n",
|
| 58 |
+
" elif self.state == self.trap:\n",
|
| 59 |
+
" return self.state, -5, True\n",
|
| 60 |
+
" return self.state, -1, False\n",
|
| 61 |
+
"\n",
|
| 62 |
+
" def reset(self):\n",
|
| 63 |
+
" self.state = (0, 0)\n",
|
| 64 |
+
" return self.state\n",
|
| 65 |
+
"\n",
|
| 66 |
+
"env = SimpleGridWorld()\n",
|
| 67 |
+
"print(\"Environment initialized!\")"
|
| 68 |
+
]
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"cell_type": "markdown",
|
| 72 |
+
"metadata": {},
|
| 73 |
+
"source": [
|
| 74 |
+
"## 2. Q-Learning Algorithm\n",
|
| 75 |
+
"\n",
|
| 76 |
+
"### Task 1: Training the Agent\n",
|
| 77 |
+
"Initialize a Q-Table (5x5x4) with zeros and train the agent for 1000 episodes using the update rule:\n",
|
| 78 |
+
"$Q(s, a) = Q(s, a) + \\alpha [R + \\gamma \\max Q(s', a') - Q(s, a)]$"
|
| 79 |
+
]
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"cell_type": "code",
|
| 83 |
+
"execution_count": null,
|
| 84 |
+
"metadata": {},
|
| 85 |
+
"outputs": [],
|
| 86 |
+
"source": [
|
| 87 |
+
"alpha = 0.1 # Learning rate\n",
|
| 88 |
+
"gamma = 0.9 # Discount factor\n",
|
| 89 |
+
"epsilon = 0.2 # Exploration rate\n",
|
| 90 |
+
"q_table = np.zeros((5, 5, 4))\n",
|
| 91 |
+
"\n",
|
| 92 |
+
"# YOUR CODE HERE\n"
|
| 93 |
+
]
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"cell_type": "markdown",
|
| 97 |
+
"metadata": {},
|
| 98 |
+
"source": [
|
| 99 |
+
"<details>\n",
|
| 100 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 101 |
+
"\n",
|
| 102 |
+
"```python\n",
|
| 103 |
+
"for episode in range(1000):\n",
|
| 104 |
+
" state = env.reset()\n",
|
| 105 |
+
" done = False\n",
|
| 106 |
+
" \n",
|
| 107 |
+
" while not done:\n",
|
| 108 |
+
" # Choose action\n",
|
| 109 |
+
" if np.random.uniform(0, 1) < epsilon:\n",
|
| 110 |
+
" action = np.random.choice(4) # Explore\n",
|
| 111 |
+
" else:\n",
|
| 112 |
+
" action = np.argmax(q_table[state[0], state[1]]) # Exploit\n",
|
| 113 |
+
" \n",
|
| 114 |
+
" next_state, reward, done = env.step(action)\n",
|
| 115 |
+
" \n",
|
| 116 |
+
" # Update Q-table\n",
|
| 117 |
+
" old_value = q_table[state[0], state[1], action]\n",
|
| 118 |
+
" next_max = np.max(q_table[next_state[0], next_state[1]])\n",
|
| 119 |
+
" \n",
|
| 120 |
+
" new_value = old_value + alpha * (reward + gamma * next_max - old_value)\n",
|
| 121 |
+
" q_table[state[0], state[1], action] = new_value\n",
|
| 122 |
+
" \n",
|
| 123 |
+
" state = next_state\n",
|
| 124 |
+
"```\n",
|
| 125 |
+
"</details>"
|
| 126 |
+
]
|
| 127 |
+
},
|
| 128 |
+
{
|
| 129 |
+
"cell_type": "markdown",
|
| 130 |
+
"metadata": {},
|
| 131 |
+
"source": [
|
| 132 |
+
"## 3. Policy Visualization\n",
|
| 133 |
+
"\n",
|
| 134 |
+
"### Task 2: What did it learn?\n",
|
| 135 |
+
"Display the learned policy by showing the best action for each cell in the grid."
|
| 136 |
+
]
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"cell_type": "code",
|
| 140 |
+
"execution_count": null,
|
| 141 |
+
"metadata": {},
|
| 142 |
+
"outputs": [],
|
| 143 |
+
"source": [
|
| 144 |
+
"# YOUR CODE HERE\n"
|
| 145 |
+
]
|
| 146 |
+
},
|
| 147 |
+
{
|
| 148 |
+
"cell_type": "markdown",
|
| 149 |
+
"metadata": {},
|
| 150 |
+
"source": [
|
| 151 |
+
"<details>\n",
|
| 152 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"```python\n",
|
| 155 |
+
"policy = np.argmax(q_table, axis=2)\n",
|
| 156 |
+
"print(\"Learned Policy (0=Up, 1=Down, 2=Left, 3=Right):\")\n",
|
| 157 |
+
"print(policy)\n",
|
| 158 |
+
"```\n",
|
| 159 |
+
"</details>"
|
| 160 |
+
]
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"cell_type": "markdown",
|
| 164 |
+
"metadata": {},
|
| 165 |
+
"source": [
|
| 166 |
+
"--- \n",
|
| 167 |
+
"### Awesome Work! \n",
|
| 168 |
+
"You've implemented a classic RL agent from scratch. This is how robots and game AI learn!\n",
|
| 169 |
+
"You have now completed the entire practice series!"
|
| 170 |
+
]
|
| 171 |
+
}
|
| 172 |
+
],
|
| 173 |
+
"metadata": {
|
| 174 |
+
"kernelspec": {
|
| 175 |
+
"display_name": "Python 3",
|
| 176 |
+
"language": "python",
|
| 177 |
+
"name": "python3"
|
| 178 |
+
},
|
| 179 |
+
"language_info": {
|
| 180 |
+
"codemirror_mode": {
|
| 181 |
+
"name": "ipython",
|
| 182 |
+
"version": 3
|
| 183 |
+
},
|
| 184 |
+
"file_extension": ".py",
|
| 185 |
+
"mimetype": "text/x-python",
|
| 186 |
+
"name": "python",
|
| 187 |
+
"nbconvert_exporter": "python",
|
| 188 |
+
"pygments_lexer": "ipython3",
|
| 189 |
+
"version": "3.12.7"
|
| 190 |
+
}
|
| 191 |
+
},
|
| 192 |
+
"nbformat": 4,
|
| 193 |
+
"nbformat_minor": 4
|
| 194 |
+
}
|
ML/21_Kaggle_Project_Medical_Costs.ipynb
ADDED
|
@@ -0,0 +1,270 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 07 - Capstone Project (Real-World Pipeline)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"In this project, we will apply everything we've learnedβfrom Statistics and EDA to Model Evaluationβusing a real-world dataset often found on **Kaggle**: The **Medical Cost Personal Dataset**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Project Goal:\n",
|
| 12 |
+
"Predict the individual medical costs billed by health insurance based on various user attributes (Age, Sex, BMI, Children, Smoker, Region).\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"### Integrated Resources:\n",
|
| 15 |
+
"- **Web Ref**: [Feature Engineering Guide](https://aashishgarg13.github.io/DataScience/feature-engineering/) (for handling 'Smoker' and 'Region' encoding).\n",
|
| 16 |
+
"- **Web Ref**: [Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/) (for checking the distribution of charges).\n",
|
| 17 |
+
"- **Web Ref**: [ML Guide](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) (for choosing the right regression algorithm).\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"---"
|
| 20 |
+
]
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"cell_type": "markdown",
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"source": [
|
| 26 |
+
"## 1. Data Acquisition\n",
|
| 27 |
+
"We will pull the raw data directly from a public repository, similar to how you would download a CSV from Kaggle."
|
| 28 |
+
]
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"cell_type": "code",
|
| 32 |
+
"execution_count": null,
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"outputs": [],
|
| 35 |
+
"source": [
|
| 36 |
+
"import pandas as pd\n",
|
| 37 |
+
"import numpy as np\n",
|
| 38 |
+
"import matplotlib.pyplot as plt\n",
|
| 39 |
+
"import seaborn as sns\n",
|
| 40 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 41 |
+
"from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
|
| 42 |
+
"from sklearn.ensemble import RandomForestRegressor\n",
|
| 43 |
+
"from sklearn.metrics import mean_absolute_error, r2_score\n",
|
| 44 |
+
"\n",
|
| 45 |
+
"# Load the dataset\n",
|
| 46 |
+
"url = \"https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv\"\n",
|
| 47 |
+
"df = pd.read_csv(url)\n",
|
| 48 |
+
"\n",
|
| 49 |
+
"print(\"Dataset size:\", df.shape)\n",
|
| 50 |
+
"df.head()"
|
| 51 |
+
]
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"cell_type": "markdown",
|
| 55 |
+
"metadata": {},
|
| 56 |
+
"source": [
|
| 57 |
+
"## 2. Phase 1: Exploratory Data Analysis (EDA)\n",
|
| 58 |
+
"\n",
|
| 59 |
+
"### Task 1: Correlation Analysis\n",
|
| 60 |
+
"Since we want to predict `charges`, create a heatmap to see which features (after converting categories) correlate most with medical costs."
|
| 61 |
+
]
|
| 62 |
+
},
|
| 63 |
+
{
|
| 64 |
+
"cell_type": "code",
|
| 65 |
+
"execution_count": null,
|
| 66 |
+
"metadata": {},
|
| 67 |
+
"outputs": [],
|
| 68 |
+
"source": [
|
| 69 |
+
"# YOUR CODE HERE\n"
|
| 70 |
+
]
|
| 71 |
+
},
|
| 72 |
+
{
|
| 73 |
+
"cell_type": "markdown",
|
| 74 |
+
"metadata": {},
|
| 75 |
+
"source": [
|
| 76 |
+
"<details>\n",
|
| 77 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 78 |
+
"\n",
|
| 79 |
+
"```python\n",
|
| 80 |
+
"# Temporary encoding just to see correlations\n",
|
| 81 |
+
"df_temp = df.copy()\n",
|
| 82 |
+
"for col in ['sex', 'smoker', 'region']: \n",
|
| 83 |
+
" df_temp[col] = LabelEncoder().fit_transform(df_temp[col])\n",
|
| 84 |
+
"\n",
|
| 85 |
+
"plt.figure(figsize=(10, 8))\n",
|
| 86 |
+
"sns.heatmap(df_temp.corr(), annot=True, cmap='coolwarm')\n",
|
| 87 |
+
"plt.title('Feature Correlation Heatmap')\n",
|
| 88 |
+
"plt.show()\n",
|
| 89 |
+
"```\n",
|
| 90 |
+
"</details>"
|
| 91 |
+
]
|
| 92 |
+
},
|
| 93 |
+
{
|
| 94 |
+
"cell_type": "markdown",
|
| 95 |
+
"metadata": {},
|
| 96 |
+
"source": [
|
| 97 |
+
"### Task 2: The 'Smoker' Effect\n",
|
| 98 |
+
"Visualization is key on Kaggle. Create a boxplot or violin plot showing `charges` separated by `smoker` status."
|
| 99 |
+
]
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"cell_type": "code",
|
| 103 |
+
"execution_count": null,
|
| 104 |
+
"metadata": {},
|
| 105 |
+
"outputs": [],
|
| 106 |
+
"source": [
|
| 107 |
+
"# YOUR CODE HERE\n"
|
| 108 |
+
]
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"cell_type": "markdown",
|
| 112 |
+
"metadata": {},
|
| 113 |
+
"source": [
|
| 114 |
+
"<details>\n",
|
| 115 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 116 |
+
"\n",
|
| 117 |
+
"```python\n",
|
| 118 |
+
"sns.boxplot(x='smoker', y='charges', data=df)\n",
|
| 119 |
+
"plt.title('Effect of Smoking on Insurance Charges')\n",
|
| 120 |
+
"plt.show()\n",
|
| 121 |
+
"```\n",
|
| 122 |
+
"</details>"
|
| 123 |
+
]
|
| 124 |
+
},
|
| 125 |
+
{
|
| 126 |
+
"cell_type": "markdown",
|
| 127 |
+
"metadata": {},
|
| 128 |
+
"source": [
|
| 129 |
+
"## 3. Phase 2: Feature Engineering\n",
|
| 130 |
+
"\n",
|
| 131 |
+
"### Task 3: categorical Transformation\n",
|
| 132 |
+
"1. Binary encode `sex` and `smoker`.\n",
|
| 133 |
+
"2. One-hot encode the `region` column."
|
| 134 |
+
]
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"cell_type": "code",
|
| 138 |
+
"execution_count": null,
|
| 139 |
+
"metadata": {},
|
| 140 |
+
"outputs": [],
|
| 141 |
+
"source": [
|
| 142 |
+
"# YOUR CODE HERE\n"
|
| 143 |
+
]
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"cell_type": "markdown",
|
| 147 |
+
"metadata": {},
|
| 148 |
+
"source": [
|
| 149 |
+
"<details>\n",
|
| 150 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 151 |
+
"\n",
|
| 152 |
+
"```python\n",
|
| 153 |
+
"df = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)\n",
|
| 154 |
+
"print(\"New Columns:\", df.columns.tolist())\n",
|
| 155 |
+
"```\n",
|
| 156 |
+
"</details>"
|
| 157 |
+
]
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"cell_type": "markdown",
|
| 161 |
+
"metadata": {},
|
| 162 |
+
"source": [
|
| 163 |
+
"## 4. Phase 3: Modeling & Optimization\n",
|
| 164 |
+
"\n",
|
| 165 |
+
"### Task 4: Training & Evaluation\n",
|
| 166 |
+
"Divide the data. Train a `RandomForestRegressor` and evaluate using $R^2$ and Mean Absolute Error (MAE).\n",
|
| 167 |
+
"\n",
|
| 168 |
+
"*Hint: Use the [Ensemble Methods Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) on your site to learn why Random Forest is great for this data.*"
|
| 169 |
+
]
|
| 170 |
+
},
|
| 171 |
+
{
|
| 172 |
+
"cell_type": "code",
|
| 173 |
+
"execution_count": null,
|
| 174 |
+
"metadata": {},
|
| 175 |
+
"outputs": [],
|
| 176 |
+
"source": [
|
| 177 |
+
"# YOUR CODE HERE\n"
|
| 178 |
+
]
|
| 179 |
+
},
|
| 180 |
+
{
|
| 181 |
+
"cell_type": "markdown",
|
| 182 |
+
"metadata": {},
|
| 183 |
+
"source": [
|
| 184 |
+
"<details>\n",
|
| 185 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 186 |
+
"\n",
|
| 187 |
+
"```python\n",
|
| 188 |
+
"X = df.drop('charges', axis=1)\n",
|
| 189 |
+
"y = df['charges']\n",
|
| 190 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
| 191 |
+
"\n",
|
| 192 |
+
"model = RandomForestRegressor(n_estimators=100, random_state=42)\n",
|
| 193 |
+
"model.fit(X_train, y_train)\n",
|
| 194 |
+
"\n",
|
| 195 |
+
"y_pred = model.predict(X_test)\n",
|
| 196 |
+
"\n",
|
| 197 |
+
"print(f\"R2 Score: {r2_score(y_test, y_pred):.4f}\")\n",
|
| 198 |
+
"print(f\"MAE: ${mean_absolute_error(y_test, y_pred):.2f}\")\n",
|
| 199 |
+
"```\n",
|
| 200 |
+
"</details>"
|
| 201 |
+
]
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
"cell_type": "markdown",
|
| 205 |
+
"metadata": {},
|
| 206 |
+
"source": [
|
| 207 |
+
"## 5. Phase 4: Interpretation\n",
|
| 208 |
+
"\n",
|
| 209 |
+
"### Task 5: Feature Importances\n",
|
| 210 |
+
"Which factor drives insurance prices the most? Visualize the model's feature importances."
|
| 211 |
+
]
|
| 212 |
+
},
|
| 213 |
+
{
|
| 214 |
+
"cell_type": "code",
|
| 215 |
+
"execution_count": null,
|
| 216 |
+
"metadata": {},
|
| 217 |
+
"outputs": [],
|
| 218 |
+
"source": [
|
| 219 |
+
"# YOUR CODE HERE\n"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"cell_type": "markdown",
|
| 224 |
+
"metadata": {},
|
| 225 |
+
"source": [
|
| 226 |
+
"<details>\n",
|
| 227 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 228 |
+
"\n",
|
| 229 |
+
"```python\n",
|
| 230 |
+
"importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)\n",
|
| 231 |
+
"sns.barplot(x=importances, y=importances.index)\n",
|
| 232 |
+
"plt.title('Key Drivers of Medical Costs')\n",
|
| 233 |
+
"plt.show()\n",
|
| 234 |
+
"```\n",
|
| 235 |
+
"</details>"
|
| 236 |
+
]
|
| 237 |
+
},
|
| 238 |
+
{
|
| 239 |
+
"cell_type": "markdown",
|
| 240 |
+
"metadata": {},
|
| 241 |
+
"source": [
|
| 242 |
+
"--- \n",
|
| 243 |
+
"### Project Complete! \n",
|
| 244 |
+
"You've just completed a full Machine Learning cycle on real-world insurance data. \n",
|
| 245 |
+
"By combining the theory from your **[DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/)** with this hands-on project, you are now ready for real Kaggle competitions!"
|
| 246 |
+
]
|
| 247 |
+
}
|
| 248 |
+
],
|
| 249 |
+
"metadata": {
|
| 250 |
+
"kernelspec": {
|
| 251 |
+
"display_name": "Python 3",
|
| 252 |
+
"language": "python",
|
| 253 |
+
"name": "python3"
|
| 254 |
+
},
|
| 255 |
+
"language_info": {
|
| 256 |
+
"codemirror_mode": {
|
| 257 |
+
"name": "ipython",
|
| 258 |
+
"version": 3
|
| 259 |
+
},
|
| 260 |
+
"file_extension": ".py",
|
| 261 |
+
"mimetype": "text/x-python",
|
| 262 |
+
"name": "python",
|
| 263 |
+
"nbconvert_exporter": "python",
|
| 264 |
+
"pygments_lexer": "ipython3",
|
| 265 |
+
"version": "3.8.0"
|
| 266 |
+
}
|
| 267 |
+
},
|
| 268 |
+
"nbformat": 4,
|
| 269 |
+
"nbformat_minor": 4
|
| 270 |
+
}
|
ML/22_SQL_for_Data_Science.ipynb
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 22 - SQL & Databases for Data Science\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"In the real world, data lives in databases, not just CSVs. This module teaches you how to bridge the gap between **SQL (Structured Query Language)** and **Python/Pandas**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Connecting to Databases**: Using `sqlite3` (built into Python).\n",
|
| 13 |
+
"2. **Basic Queries**: SELECT, WHERE, and JOIN in Python.\n",
|
| 14 |
+
"3. **SQL to Pandas**: Loading query results directly into a DataFrame.\n",
|
| 15 |
+
"4. **Database Design**: Understanding primary keys and foreign keys.\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"---"
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"cell_type": "markdown",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"## 1. Setting up a Virtual Database\n",
|
| 25 |
+
"We will create an in-memory database and populate it with some sample Data Science job data."
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "code",
|
| 30 |
+
"execution_count": null,
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"outputs": [],
|
| 33 |
+
"source": [
|
| 34 |
+
"import sqlite3\n",
|
| 35 |
+
"import pandas as pd\n",
|
| 36 |
+
"\n",
|
| 37 |
+
"# Create a connection to an in-memory database\n",
|
| 38 |
+
"conn = sqlite3.connect(':memory:')\n",
|
| 39 |
+
"cursor = conn.cursor()\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"# Create a sample table\n",
|
| 42 |
+
"cursor.execute('''\n",
|
| 43 |
+
" CREATE TABLE jobs (\n",
|
| 44 |
+
" id INTEGER PRIMARY KEY,\n",
|
| 45 |
+
" title TEXT,\n",
|
| 46 |
+
" company TEXT,\n",
|
| 47 |
+
" salary INTEGER\n",
|
| 48 |
+
" )\n",
|
| 49 |
+
"''')\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"# Insert sample records\n",
|
| 52 |
+
"jobs = [\n",
|
| 53 |
+
" (1, 'Data Scientist', 'Google', 150000),\n",
|
| 54 |
+
" (2, 'ML Engineer', 'Tesla', 160000),\n",
|
| 55 |
+
" (3, 'Data Analyst', 'Netflix', 120000),\n",
|
| 56 |
+
" (4, 'AI Research', 'OpenAI', 200000)\n",
|
| 57 |
+
"]\n",
|
| 58 |
+
"cursor.executemany('INSERT INTO jobs VALUES (?,?,?,?)', jobs)\n",
|
| 59 |
+
"conn.commit()\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"print(\"Database created and table populated!\")"
|
| 62 |
+
]
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"cell_type": "markdown",
|
| 66 |
+
"metadata": {},
|
| 67 |
+
"source": [
|
| 68 |
+
"## 2. Basic SQL Queries in Python\n",
|
| 69 |
+
"\n",
|
| 70 |
+
"### Task 1: Fetching Data\n",
|
| 71 |
+
"Use standard SQL to fetch all jobs where the salary is greater than 140,000."
|
| 72 |
+
]
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"cell_type": "code",
|
| 76 |
+
"execution_count": null,
|
| 77 |
+
"metadata": {},
|
| 78 |
+
"outputs": [],
|
| 79 |
+
"source": [
|
| 80 |
+
"# YOUR CODE HERE\n"
|
| 81 |
+
]
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"cell_type": "markdown",
|
| 85 |
+
"metadata": {},
|
| 86 |
+
"source": [
|
| 87 |
+
"<details>\n",
|
| 88 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"```python\n",
|
| 91 |
+
"query = \"SELECT * FROM jobs WHERE salary > 140000\"\n",
|
| 92 |
+
"cursor.execute(query)\n",
|
| 93 |
+
"results = cursor.fetchall()\n",
|
| 94 |
+
"for row in results:\n",
|
| 95 |
+
" print(row)\n",
|
| 96 |
+
"```\n",
|
| 97 |
+
"</details>"
|
| 98 |
+
]
|
| 99 |
+
},
|
| 100 |
+
{
|
| 101 |
+
"cell_type": "markdown",
|
| 102 |
+
"metadata": {},
|
| 103 |
+
"source": [
|
| 104 |
+
"## 3. SQL to Pandas: The Professional Way\n",
|
| 105 |
+
"\n",
|
| 106 |
+
"### Task 2: pd.read_sql_query\n",
|
| 107 |
+
"Professionals use `pd.read_sql_query()` to pull data directly into a DataFrame. Try it now."
|
| 108 |
+
]
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"cell_type": "code",
|
| 112 |
+
"execution_count": null,
|
| 113 |
+
"metadata": {},
|
| 114 |
+
"outputs": [],
|
| 115 |
+
"source": [
|
| 116 |
+
"# YOUR CODE HERE\n"
|
| 117 |
+
]
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"cell_type": "markdown",
|
| 121 |
+
"metadata": {},
|
| 122 |
+
"source": [
|
| 123 |
+
"<details>\n",
|
| 124 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 125 |
+
"\n",
|
| 126 |
+
"```python\n",
|
| 127 |
+
"df_sql = pd.read_sql_query(\"SELECT * FROM jobs\", conn)\n",
|
| 128 |
+
"print(df_sql.head())\n",
|
| 129 |
+
"```\n",
|
| 130 |
+
"</details>"
|
| 131 |
+
]
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"cell_type": "markdown",
|
| 135 |
+
"metadata": {},
|
| 136 |
+
"source": [
|
| 137 |
+
"--- \n",
|
| 138 |
+
"### Bridge Completed! \n",
|
| 139 |
+
"You now know how to pull data from any standard relational database.\n",
|
| 140 |
+
"Next: **Model Explainability (SHAP)**."
|
| 141 |
+
]
|
| 142 |
+
}
|
| 143 |
+
],
|
| 144 |
+
"metadata": {
|
| 145 |
+
"kernelspec": {
|
| 146 |
+
"display_name": "Python 3",
|
| 147 |
+
"language": "python",
|
| 148 |
+
"name": "python3"
|
| 149 |
+
},
|
| 150 |
+
"language_info": {
|
| 151 |
+
"codemirror_mode": {
|
| 152 |
+
"name": "ipython",
|
| 153 |
+
"version": 3
|
| 154 |
+
},
|
| 155 |
+
"file_extension": ".py",
|
| 156 |
+
"mimetype": "text/x-python",
|
| 157 |
+
"name": "python",
|
| 158 |
+
"nbconvert_exporter": "python",
|
| 159 |
+
"pygments_lexer": "ipython3",
|
| 160 |
+
"version": "3.12.7"
|
| 161 |
+
}
|
| 162 |
+
},
|
| 163 |
+
"nbformat": 4,
|
| 164 |
+
"nbformat_minor": 4
|
| 165 |
+
}
|
ML/23_Model_Explainability_SHAP.ipynb
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 23 - Model Explainability (SHAP)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to the final \"Industry-Grade\" module! **Model Explainability** is about knowing *why* your model made a decision. This is critical for building trust, especially in sensitive areas like finance or medicine.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Global Interpretability**: Which features matter most across the whole dataset?\n",
|
| 13 |
+
"2. **Local Interpretability**: Why was *this specific person* denied a loan?\n",
|
| 14 |
+
"3. **SHAP values**: Game-theoretic approach to feature contribution.\n",
|
| 15 |
+
"\n",
|
| 16 |
+
"---"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"metadata": {},
|
| 22 |
+
"source": [
|
| 23 |
+
"## 1. Setup\n",
|
| 24 |
+
"We will use a small Random Forest classifier on the **Breast Cancer** dataset."
|
| 25 |
+
]
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"cell_type": "code",
|
| 29 |
+
"execution_count": null,
|
| 30 |
+
"metadata": {},
|
| 31 |
+
"outputs": [],
|
| 32 |
+
"source": [
|
| 33 |
+
"import pandas as pd\n",
|
| 34 |
+
"import numpy as np\n",
|
| 35 |
+
"from sklearn.datasets import load_breast_cancer\n",
|
| 36 |
+
"from sklearn.ensemble import RandomForestClassifier\n",
|
| 37 |
+
"from sklearn.model_selection import train_test_split\n",
|
| 38 |
+
"\n",
|
| 39 |
+
"# Note: You will need to install shap: pip install shap\n",
|
| 40 |
+
"import shap\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# Load data\n",
|
| 43 |
+
"data = load_breast_cancer()\n",
|
| 44 |
+
"X = pd.DataFrame(data.data, columns=data.feature_names)\n",
|
| 45 |
+
"y = data.target\n",
|
| 46 |
+
"\n",
|
| 47 |
+
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
|
| 48 |
+
"\n",
|
| 49 |
+
"# Train a model\n",
|
| 50 |
+
"model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
|
| 51 |
+
"model.fit(X_train, y_train)\n",
|
| 52 |
+
"\n",
|
| 53 |
+
"print(\"Model trained!\")"
|
| 54 |
+
]
|
| 55 |
+
},
|
| 56 |
+
{
|
| 57 |
+
"cell_type": "markdown",
|
| 58 |
+
"metadata": {},
|
| 59 |
+
"source": [
|
| 60 |
+
"## 2. Using SHAP (Global)\n",
|
| 61 |
+
"\n",
|
| 62 |
+
"### Task 1: Summary Plot\n",
|
| 63 |
+
"Create a SHAP Tree Explainer and plot a summary of the feature importances. This is more detailed than standard feature importance as it shows the direction (positive/negative) of the impact."
|
| 64 |
+
]
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"cell_type": "code",
|
| 68 |
+
"execution_count": null,
|
| 69 |
+
"metadata": {},
|
| 70 |
+
"outputs": [],
|
| 71 |
+
"source": [
|
| 72 |
+
"# YOUR CODE HERE\n"
|
| 73 |
+
]
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"cell_type": "markdown",
|
| 77 |
+
"metadata": {},
|
| 78 |
+
"source": [
|
| 79 |
+
"<details>\n",
|
| 80 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 81 |
+
"\n",
|
| 82 |
+
"```python\n",
|
| 83 |
+
"explainer = shap.TreeExplainer(model)\n",
|
| 84 |
+
"shap_values = explainer.shap_values(X_test)\n",
|
| 85 |
+
"\n",
|
| 86 |
+
"# For binary classification, use [1] for the positive class\n",
|
| 87 |
+
"shap.summary_plot(shap_values[1], X_test)\n",
|
| 88 |
+
"```\n",
|
| 89 |
+
"</details>"
|
| 90 |
+
]
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"cell_type": "markdown",
|
| 94 |
+
"metadata": {},
|
| 95 |
+
"source": [
|
| 96 |
+
"## 3. Local Performance\n",
|
| 97 |
+
"\n",
|
| 98 |
+
"### Task 2: Force Plot\n",
|
| 99 |
+
"Pick the first person in the test set and explain the model's prediction for them specifically using a force plot."
|
| 100 |
+
]
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"cell_type": "code",
|
| 104 |
+
"execution_count": null,
|
| 105 |
+
"metadata": {},
|
| 106 |
+
"outputs": [],
|
| 107 |
+
"source": [
|
| 108 |
+
"# YOUR CODE HERE\n"
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"cell_type": "markdown",
|
| 113 |
+
"metadata": {},
|
| 114 |
+
"source": [
|
| 115 |
+
"<details>\n",
|
| 116 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"```python\n",
|
| 119 |
+
"# Plot for the first record in the test set\n",
|
| 120 |
+
"shap.initjs()\n",
|
| 121 |
+
"shap.force_plot(explainer.expected_value[1], shap_values[1][0,:], X_test.iloc[0,:])\n",
|
| 122 |
+
"```\n",
|
| 123 |
+
"</details>"
|
| 124 |
+
]
|
| 125 |
+
},
|
| 126 |
+
{
|
| 127 |
+
"cell_type": "markdown",
|
| 128 |
+
"metadata": {},
|
| 129 |
+
"source": [
|
| 130 |
+
"--- \n",
|
| 131 |
+
"### The Ultimate Skill Unlocked! \n",
|
| 132 |
+
"You can now explain black-box models to humans. This is the mark of a top-tier Data Scientist.\n",
|
| 133 |
+
"You have completed all 23 modules of the master series!"
|
| 134 |
+
]
|
| 135 |
+
}
|
| 136 |
+
],
|
| 137 |
+
"metadata": {
|
| 138 |
+
"kernelspec": {
|
| 139 |
+
"display_name": "Python 3",
|
| 140 |
+
"language": "python",
|
| 141 |
+
"name": "python3"
|
| 142 |
+
},
|
| 143 |
+
"language_info": {
|
| 144 |
+
"codemirror_mode": {
|
| 145 |
+
"name": "ipython",
|
| 146 |
+
"version": 3
|
| 147 |
+
},
|
| 148 |
+
"file_extension": ".py",
|
| 149 |
+
"mimetype": "text/x-python",
|
| 150 |
+
"name": "python",
|
| 151 |
+
"nbconvert_exporter": "python",
|
| 152 |
+
"pygments_lexer": "ipython3",
|
| 153 |
+
"version": "3.12.7"
|
| 154 |
+
}
|
| 155 |
+
},
|
| 156 |
+
"nbformat": 4,
|
| 157 |
+
"nbformat_minor": 4
|
| 158 |
+
}
|
ML/24_Deep_Learning_TensorFlow.ipynb
ADDED
|
@@ -0,0 +1,231 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 24 - Deep Learning with TensorFlow/Keras\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Welcome to the world of modern **Deep Learning**! While we covered basic Neural Networks with Scikit-Learn, TensorFlow/Keras is the industry standard for building production-grade deep learning models.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Sequential API**: Building neural networks layer by layer.\n",
|
| 13 |
+
"2. **Activations**: ReLU, Sigmoid, Softmax for different layers.\n",
|
| 14 |
+
"3. **Optimization**: Adam, SGD, Learning rate scheduling.\n",
|
| 15 |
+
"4. **Callbacks**: Early stopping and Model checkpointing.\n",
|
| 16 |
+
"5. **Computer Vision**: Building a CNN for image classification.\n",
|
| 17 |
+
"\n",
|
| 18 |
+
"---"
|
| 19 |
+
]
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"cell_type": "markdown",
|
| 23 |
+
"metadata": {},
|
| 24 |
+
"source": [
|
| 25 |
+
"## 1. Setup\n",
|
| 26 |
+
"We will use the **MNIST** dataset for handwritten digit classification."
|
| 27 |
+
]
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"cell_type": "code",
|
| 31 |
+
"execution_count": null,
|
| 32 |
+
"metadata": {},
|
| 33 |
+
"outputs": [],
|
| 34 |
+
"source": [
|
| 35 |
+
"import numpy as np\n",
|
| 36 |
+
"import matplotlib.pyplot as plt\n",
|
| 37 |
+
"import tensorflow as tf\n",
|
| 38 |
+
"from tensorflow import keras\n",
|
| 39 |
+
"from tensorflow.keras import layers\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"# Load MNIST\n",
|
| 42 |
+
"(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()\n",
|
| 43 |
+
"\n",
|
| 44 |
+
"# Normalize to 0-1\n",
|
| 45 |
+
"X_train = X_train.astype('float32') / 255.0\n",
|
| 46 |
+
"X_test = X_test.astype('float32') / 255.0\n",
|
| 47 |
+
"\n",
|
| 48 |
+
"print(f\"Training shape: {X_train.shape}\")\n",
|
| 49 |
+
"print(f\"Test shape: {X_test.shape}\")"
|
| 50 |
+
]
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"cell_type": "markdown",
|
| 54 |
+
"metadata": {},
|
| 55 |
+
"source": [
|
| 56 |
+
"## 2. Building a Simple Neural Network\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"### Task 1: Sequential Model\n",
|
| 59 |
+
"Create a Sequential model with:\n",
|
| 60 |
+
"1. Flatten layer (to convert 28x28 to 784)\n",
|
| 61 |
+
"2. Dense layer with 128 units and ReLU activation\n",
|
| 62 |
+
"3. Output Dense layer with 10 units and Softmax activation"
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "code",
|
| 67 |
+
"execution_count": null,
|
| 68 |
+
"metadata": {},
|
| 69 |
+
"outputs": [],
|
| 70 |
+
"source": [
|
| 71 |
+
"# YOUR CODE HERE"
|
| 72 |
+
]
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"cell_type": "markdown",
|
| 76 |
+
"metadata": {},
|
| 77 |
+
"source": [
|
| 78 |
+
"<details>\n",
|
| 79 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"```python\n",
|
| 82 |
+
"model = keras.Sequential([\n",
|
| 83 |
+
" layers.Flatten(input_shape=(28, 28)),\n",
|
| 84 |
+
" layers.Dense(128, activation='relu'),\n",
|
| 85 |
+
" layers.Dense(10, activation='softmax')\n",
|
| 86 |
+
"])\n",
|
| 87 |
+
"\n",
|
| 88 |
+
"model.summary()\n",
|
| 89 |
+
"```\n",
|
| 90 |
+
"</details>"
|
| 91 |
+
]
|
| 92 |
+
},
|
| 93 |
+
{
|
| 94 |
+
"cell_type": "markdown",
|
| 95 |
+
"metadata": {},
|
| 96 |
+
"source": [
|
| 97 |
+
"## 3. Compiling & Training\n",
|
| 98 |
+
"\n",
|
| 99 |
+
"### Task 2: Compile and Fit\n",
|
| 100 |
+
"Compile the model with:\n",
|
| 101 |
+
"- Optimizer: 'adam'\n",
|
| 102 |
+
"- Loss: 'sparse_categorical_crossentropy'\n",
|
| 103 |
+
"- Metrics: 'accuracy'\n",
|
| 104 |
+
"\n",
|
| 105 |
+
"Train for 5 epochs with a validation split of 0.2."
|
| 106 |
+
]
|
| 107 |
+
},
|
| 108 |
+
{
|
| 109 |
+
"cell_type": "code",
|
| 110 |
+
"execution_count": null,
|
| 111 |
+
"metadata": {},
|
| 112 |
+
"outputs": [],
|
| 113 |
+
"source": [
|
| 114 |
+
"# YOUR CODE HERE"
|
| 115 |
+
]
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"cell_type": "markdown",
|
| 119 |
+
"metadata": {},
|
| 120 |
+
"source": [
|
| 121 |
+
"<details>\n",
|
| 122 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"```python\n",
|
| 125 |
+
"model.compile(\n",
|
| 126 |
+
" optimizer='adam',\n",
|
| 127 |
+
" loss='sparse_categorical_crossentropy',\n",
|
| 128 |
+
" metrics=['accuracy']\n",
|
| 129 |
+
")\n",
|
| 130 |
+
"\n",
|
| 131 |
+
"history = model.fit(\n",
|
| 132 |
+
" X_train, y_train,\n",
|
| 133 |
+
" epochs=5,\n",
|
| 134 |
+
" validation_split=0.2\n",
|
| 135 |
+
")\n",
|
| 136 |
+
"```\n",
|
| 137 |
+
"</details>"
|
| 138 |
+
]
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"cell_type": "markdown",
|
| 142 |
+
"metadata": {},
|
| 143 |
+
"source": [
|
| 144 |
+
"## 4. Convolutional Neural Networks (CNN)\n",
|
| 145 |
+
"\n",
|
| 146 |
+
"### Task 3: Building a CNN\n",
|
| 147 |
+
"Create a CNN with:\n",
|
| 148 |
+
"1. Conv2D layer (32 filters, 3x3 kernel, ReLU)\n",
|
| 149 |
+
"2. MaxPooling2D (2x2)\n",
|
| 150 |
+
"3. Conv2D layer (64 filters, 3x3 kernel, ReLU)\n",
|
| 151 |
+
"4. MaxPooling2D (2x2)\n",
|
| 152 |
+
"5. Flatten\n",
|
| 153 |
+
"6. Dense (128, ReLU)\n",
|
| 154 |
+
"7. Dense (10, Softmax)"
|
| 155 |
+
]
|
| 156 |
+
},
|
| 157 |
+
{
|
| 158 |
+
"cell_type": "code",
|
| 159 |
+
"execution_count": null,
|
| 160 |
+
"metadata": {},
|
| 161 |
+
"outputs": [],
|
| 162 |
+
"source": [
|
| 163 |
+
"# Reshape for CNN (add channel dimension)\n",
|
| 164 |
+
"X_train_cnn = X_train.reshape(-1, 28, 28, 1)\n",
|
| 165 |
+
"X_test_cnn = X_test.reshape(-1, 28, 28, 1)\n",
|
| 166 |
+
"\n",
|
| 167 |
+
"# YOUR CODE HERE"
|
| 168 |
+
]
|
| 169 |
+
},
|
| 170 |
+
{
|
| 171 |
+
"cell_type": "markdown",
|
| 172 |
+
"metadata": {},
|
| 173 |
+
"source": [
|
| 174 |
+
"<details>\n",
|
| 175 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 176 |
+
"\n",
|
| 177 |
+
"```python\n",
|
| 178 |
+
"cnn_model = keras.Sequential([\n",
|
| 179 |
+
" layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),\n",
|
| 180 |
+
" layers.MaxPooling2D((2, 2)),\n",
|
| 181 |
+
" layers.Conv2D(64, (3, 3), activation='relu'),\n",
|
| 182 |
+
" layers.MaxPooling2D((2, 2)),\n",
|
| 183 |
+
" layers.Flatten(),\n",
|
| 184 |
+
" layers.Dense(128, activation='relu'),\n",
|
| 185 |
+
" layers.Dense(10, activation='softmax')\n",
|
| 186 |
+
"])\n",
|
| 187 |
+
"\n",
|
| 188 |
+
"cnn_model.compile(\n",
|
| 189 |
+
" optimizer='adam',\n",
|
| 190 |
+
" loss='sparse_categorical_crossentropy',\n",
|
| 191 |
+
" metrics=['accuracy']\n",
|
| 192 |
+
")\n",
|
| 193 |
+
"\n",
|
| 194 |
+
"cnn_model.fit(X_train_cnn, y_train, epochs=3, validation_split=0.2)\n",
|
| 195 |
+
"```\n",
|
| 196 |
+
"</details>"
|
| 197 |
+
]
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"cell_type": "markdown",
|
| 201 |
+
"metadata": {},
|
| 202 |
+
"source": [
|
| 203 |
+
"--- \n",
|
| 204 |
+
"### Deep Learning Unlocked! \n",
|
| 205 |
+
"You've now mastered TensorFlow/Keras, the most popular deep learning framework.\n",
|
| 206 |
+
"Next: **Model Deployment with Streamlit**."
|
| 207 |
+
]
|
| 208 |
+
}
|
| 209 |
+
],
|
| 210 |
+
"metadata": {
|
| 211 |
+
"kernelspec": {
|
| 212 |
+
"display_name": "Python 3",
|
| 213 |
+
"language": "python",
|
| 214 |
+
"name": "python3"
|
| 215 |
+
},
|
| 216 |
+
"language_info": {
|
| 217 |
+
"codemirror_mode": {
|
| 218 |
+
"name": "ipython",
|
| 219 |
+
"version": 3
|
| 220 |
+
},
|
| 221 |
+
"file_extension": ".py",
|
| 222 |
+
"mimetype": "text/x-python",
|
| 223 |
+
"name": "python",
|
| 224 |
+
"nbconvert_exporter": "python",
|
| 225 |
+
"pygments_lexer": "ipython3",
|
| 226 |
+
"version": "3.12.7"
|
| 227 |
+
}
|
| 228 |
+
},
|
| 229 |
+
"nbformat": 4,
|
| 230 |
+
"nbformat_minor": 4
|
| 231 |
+
}
|
ML/25_Model_Deployment_Streamlit.ipynb
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 25 - Model Deployment with Streamlit\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"A model in a notebook is just an experiment. A **deployed model** is a product! In this module, you'll learn to turn your ML models into interactive web applications using **Streamlit**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Objectives:\n",
|
| 12 |
+
"1. **Streamlit Basics**: Creating interactive UIs with pure Python.\n",
|
| 13 |
+
"2. **Model Persistence**: Saving and loading models with `joblib`.\n",
|
| 14 |
+
"3. **User Input**: Sliders, text boxes, and file uploads.\n",
|
| 15 |
+
"4. **Real-Time Prediction**: Deploying your Iris classifier as a web app.\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"---"
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"cell_type": "markdown",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"## 1. Training and Saving a Model\n",
|
| 25 |
+
"\n",
|
| 26 |
+
"First, let's train a simple classifier and save it to disk."
|
| 27 |
+
]
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"cell_type": "code",
|
| 31 |
+
"execution_count": null,
|
| 32 |
+
"metadata": {},
|
| 33 |
+
"outputs": [],
|
| 34 |
+
"source": [
|
| 35 |
+
"from sklearn.datasets import load_iris\n",
|
| 36 |
+
"from sklearn.ensemble import RandomForestClassifier\n",
|
| 37 |
+
"import joblib\n",
|
| 38 |
+
"\n",
|
| 39 |
+
"# Load and train\n",
|
| 40 |
+
"iris = load_iris()\n",
|
| 41 |
+
"X, y = iris.data, iris.target\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
|
| 44 |
+
"model.fit(X, y)\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"# Save the model\n",
|
| 47 |
+
"joblib.dump(model, 'iris_model.pkl')\n",
|
| 48 |
+
"print(\"Model saved as iris_model.pkl\")"
|
| 49 |
+
]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"cell_type": "markdown",
|
| 53 |
+
"metadata": {},
|
| 54 |
+
"source": [
|
| 55 |
+
"## 2. Creating a Streamlit App\n",
|
| 56 |
+
"\n",
|
| 57 |
+
"### Task 1: Build the App\n",
|
| 58 |
+
"Create a file called `app.py` with the following Streamlit code. This app will:\n",
|
| 59 |
+
"1. Load the saved model\n",
|
| 60 |
+
"2. Accept user inputs (sepal/petal measurements)\n",
|
| 61 |
+
"3. Make predictions in real-time"
|
| 62 |
+
]
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"cell_type": "code",
|
| 66 |
+
"execution_count": null,
|
| 67 |
+
"metadata": {},
|
| 68 |
+
"outputs": [],
|
| 69 |
+
"source": [
|
| 70 |
+
"%%writefile app.py\n",
|
| 71 |
+
"import streamlit as st\n",
|
| 72 |
+
"import joblib\n",
|
| 73 |
+
"import numpy as np\n",
|
| 74 |
+
"\n",
|
| 75 |
+
"# Load the model\n",
|
| 76 |
+
"model = joblib.load('iris_model.pkl')\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"st.title('πΈ Iris Species Predictor')\n",
|
| 79 |
+
"st.write('Enter the flower measurements to predict the species!')\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"# User inputs\n",
|
| 82 |
+
"sepal_length = st.slider('Sepal Length (cm)', 4.0, 8.0, 5.8)\n",
|
| 83 |
+
"sepal_width = st.slider('Sepal Width (cm)', 2.0, 4.5, 3.0)\n",
|
| 84 |
+
"petal_length = st.slider('Petal Length (cm)', 1.0, 7.0, 4.0)\n",
|
| 85 |
+
"petal_width = st.slider('Petal Width (cm)', 0.1, 2.5, 1.2)\n",
|
| 86 |
+
"\n",
|
| 87 |
+
"# Make prediction\n",
|
| 88 |
+
"if st.button('Predict Species'):\n",
|
| 89 |
+
" features = np.array([[sepal_length, sepal_width, petal_length, petal_width]])\n",
|
| 90 |
+
" prediction = model.predict(features)\n",
|
| 91 |
+
" species = ['Setosa', 'Versicolor', 'Virginica']\n",
|
| 92 |
+
" \n",
|
| 93 |
+
" st.success(f'Predicted Species: **{species[prediction[0]]}**')\n",
|
| 94 |
+
" st.balloons()"
|
| 95 |
+
]
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"cell_type": "markdown",
|
| 99 |
+
"metadata": {},
|
| 100 |
+
"source": [
|
| 101 |
+
"## 3. Running the App\n",
|
| 102 |
+
"\n",
|
| 103 |
+
"### Task 2: Launch Streamlit\n",
|
| 104 |
+
"Open your terminal and run:\n",
|
| 105 |
+
"```bash\n",
|
| 106 |
+
"streamlit run app.py\n",
|
| 107 |
+
"```\n",
|
| 108 |
+
"\n",
|
| 109 |
+
"Your browser will open with an interactive web app!"
|
| 110 |
+
]
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"cell_type": "markdown",
|
| 114 |
+
"metadata": {},
|
| 115 |
+
"source": [
|
| 116 |
+
"## 4. Advanced Features\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"### Task 3: File Upload\n",
|
| 119 |
+
"Modify `app.py` to allow users to upload a CSV file and make batch predictions."
|
| 120 |
+
]
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"cell_type": "markdown",
|
| 124 |
+
"metadata": {},
|
| 125 |
+
"source": [
|
| 126 |
+
"<details>\n",
|
| 127 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 128 |
+
"\n",
|
| 129 |
+
"```python\n",
|
| 130 |
+
"# Add this to your app.py\n",
|
| 131 |
+
"import pandas as pd\n",
|
| 132 |
+
"\n",
|
| 133 |
+
"uploaded_file = st.file_uploader(\"Upload CSV for batch predictions\", type=\"csv\")\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"if uploaded_file is not None:\n",
|
| 136 |
+
" df = pd.read_csv(uploaded_file)\n",
|
| 137 |
+
" predictions = model.predict(df)\n",
|
| 138 |
+
" df['Predicted Species'] = [species[p] for p in predictions]\n",
|
| 139 |
+
" st.write(df)\n",
|
| 140 |
+
"```\n",
|
| 141 |
+
"</details>"
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"cell_type": "markdown",
|
| 146 |
+
"metadata": {},
|
| 147 |
+
"source": [
|
| 148 |
+
"--- \n",
|
| 149 |
+
"### Deployment Mastered! \n",
|
| 150 |
+
"You now know how to turn any ML model into a shareable web app.\n",
|
| 151 |
+
"Next: **End-to-End ML Project Workflow**."
|
| 152 |
+
]
|
| 153 |
+
}
|
| 154 |
+
],
|
| 155 |
+
"metadata": {
|
| 156 |
+
"kernelspec": {
|
| 157 |
+
"display_name": "Python 3",
|
| 158 |
+
"language": "python",
|
| 159 |
+
"name": "python3"
|
| 160 |
+
},
|
| 161 |
+
"language_info": {
|
| 162 |
+
"codemirror_mode": {
|
| 163 |
+
"name": "ipython",
|
| 164 |
+
"version": 3
|
| 165 |
+
},
|
| 166 |
+
"file_extension": ".py",
|
| 167 |
+
"mimetype": "text/x-python",
|
| 168 |
+
"name": "python",
|
| 169 |
+
"nbconvert_exporter": "python",
|
| 170 |
+
"pygments_lexer": "ipython3",
|
| 171 |
+
"version": "3.12.7"
|
| 172 |
+
}
|
| 173 |
+
},
|
| 174 |
+
"nbformat": 4,
|
| 175 |
+
"nbformat_minor": 4
|
| 176 |
+
}
|
ML/26_End_to_End_ML_Project.ipynb
ADDED
|
@@ -0,0 +1,298 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# ML Practice Series: Module 26 - End-to-End ML Project (Production Pipeline)\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"This is the **FINAL MODULE** and the ultimate test of everything you've learned. You will build a complete, production-ready ML system from scratch that includes:\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"### Full Production Workflow:\n",
|
| 12 |
+
"1. **Problem Definition & Data Collection**\n",
|
| 13 |
+
"2. **EDA & Statistical Analysis**\n",
|
| 14 |
+
"3. **Feature Engineering & Selection**\n",
|
| 15 |
+
"4. **Model Selection & Hyperparameter Tuning**\n",
|
| 16 |
+
"5. **Model Evaluation & Explainability (SHAP)**\n",
|
| 17 |
+
"6. **Model Persistence & Deployment**\n",
|
| 18 |
+
"7. **Monitoring & Documentation**\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"### Dataset:\n",
|
| 21 |
+
"We will use the **Credit Card Fraud Detection** dataset (highly imbalanced, real-world complexity).\n",
|
| 22 |
+
"\n",
|
| 23 |
+
"---"
|
| 24 |
+
]
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"cell_type": "markdown",
|
| 28 |
+
"metadata": {},
|
| 29 |
+
"source": [
|
| 30 |
+
"## Phase 1: Problem Understanding & Data Loading\n",
|
| 31 |
+
"\n",
|
| 32 |
+
"### Business Goal:\n",
|
| 33 |
+
"Build a model to detect fraudulent credit card transactions to minimize financial losses.\n",
|
| 34 |
+
"\n",
|
| 35 |
+
"**Success Metrics**: Precision, Recall, F1-Score (since data is imbalanced)"
|
| 36 |
+
]
|
| 37 |
+
},
|
| 38 |
+
{
|
| 39 |
+
"cell_type": "code",
|
| 40 |
+
"execution_count": null,
|
| 41 |
+
"metadata": {},
|
| 42 |
+
"outputs": [],
|
| 43 |
+
"source": [
|
| 44 |
+
"import pandas as pd\n",
|
| 45 |
+
"import numpy as np\n",
|
| 46 |
+
"import matplotlib.pyplot as plt\n",
|
| 47 |
+
"import seaborn as sns\n",
|
| 48 |
+
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
|
| 49 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
| 50 |
+
"from sklearn.ensemble import RandomForestClassifier\n",
|
| 51 |
+
"from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score\n",
|
| 52 |
+
"import joblib\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"# For this demo, we'll use a simulated dataset\n",
|
| 55 |
+
"# In production, replace with: pd.read_csv('creditcard.csv')\n",
|
| 56 |
+
"np.random.seed(42)\n",
|
| 57 |
+
"df = pd.DataFrame({\n",
|
| 58 |
+
" 'Amount': np.random.uniform(1, 5000, 1000),\n",
|
| 59 |
+
" 'Time': np.random.uniform(0, 172800, 1000),\n",
|
| 60 |
+
" 'V1': np.random.randn(1000),\n",
|
| 61 |
+
" 'V2': np.random.randn(1000),\n",
|
| 62 |
+
" 'Class': np.random.choice([0, 1], 1000, p=[0.95, 0.05])\n",
|
| 63 |
+
"})\n",
|
| 64 |
+
"\n",
|
| 65 |
+
"print(\"Dataset loaded!\")\n",
|
| 66 |
+
"df.head()"
|
| 67 |
+
]
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"cell_type": "markdown",
|
| 71 |
+
"metadata": {},
|
| 72 |
+
"source": [
|
| 73 |
+
"## Phase 2: Exploratory Data Analysis (EDA)\n",
|
| 74 |
+
"\n",
|
| 75 |
+
"### Task 1: Check Class Imbalance\n",
|
| 76 |
+
"Plot the distribution of fraud vs non-fraud transactions."
|
| 77 |
+
]
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"cell_type": "code",
|
| 81 |
+
"execution_count": null,
|
| 82 |
+
"metadata": {},
|
| 83 |
+
"outputs": [],
|
| 84 |
+
"source": [
|
| 85 |
+
"# YOUR CODE HERE"
|
| 86 |
+
]
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"cell_type": "markdown",
|
| 90 |
+
"metadata": {},
|
| 91 |
+
"source": [
|
| 92 |
+
"<details>\n",
|
| 93 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"```python\n",
|
| 96 |
+
"sns.countplot(x='Class', data=df)\n",
|
| 97 |
+
"plt.title('Fraud vs Normal Transactions')\n",
|
| 98 |
+
"plt.show()\n",
|
| 99 |
+
"print(df['Class'].value_counts())\n",
|
| 100 |
+
"```\n",
|
| 101 |
+
"</details>"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "markdown",
|
| 106 |
+
"metadata": {},
|
| 107 |
+
"source": [
|
| 108 |
+
"## Phase 3: Feature Engineering\n",
|
| 109 |
+
"\n",
|
| 110 |
+
"### Task 2: Scaling & Train-Test Split\n",
|
| 111 |
+
"1. Scale the `Amount` and `Time` columns\n",
|
| 112 |
+
"2. Split data (80/20) with stratification"
|
| 113 |
+
]
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"cell_type": "code",
|
| 117 |
+
"execution_count": null,
|
| 118 |
+
"metadata": {},
|
| 119 |
+
"outputs": [],
|
| 120 |
+
"source": [
|
| 121 |
+
"# YOUR CODE HERE"
|
| 122 |
+
]
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"cell_type": "markdown",
|
| 126 |
+
"metadata": {},
|
| 127 |
+
"source": [
|
| 128 |
+
"<details>\n",
|
| 129 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 130 |
+
"\n",
|
| 131 |
+
"```python\n",
|
| 132 |
+
"scaler = StandardScaler()\n",
|
| 133 |
+
"df[['Amount', 'Time']] = scaler.fit_transform(df[['Amount', 'Time']])\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"X = df.drop('Class', axis=1)\n",
|
| 136 |
+
"y = df['Class']\n",
|
| 137 |
+
"\n",
|
| 138 |
+
"X_train, X_test, y_train, y_test = train_test_split(\n",
|
| 139 |
+
" X, y, test_size=0.2, stratify=y, random_state=42\n",
|
| 140 |
+
")\n",
|
| 141 |
+
"```\n",
|
| 142 |
+
"</details>"
|
| 143 |
+
]
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"cell_type": "markdown",
|
| 147 |
+
"metadata": {},
|
| 148 |
+
"source": [
|
| 149 |
+
"## Phase 4: Model Training & Hyperparameter Tuning\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"### Task 3: GridSearchCV\n",
|
| 152 |
+
"Use GridSearch to find the best `max_depth` and `n_estimators` for a Random Forest."
|
| 153 |
+
]
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"cell_type": "code",
|
| 157 |
+
"execution_count": null,
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"outputs": [],
|
| 160 |
+
"source": [
|
| 161 |
+
"# YOUR CODE HERE"
|
| 162 |
+
]
|
| 163 |
+
},
|
| 164 |
+
{
|
| 165 |
+
"cell_type": "markdown",
|
| 166 |
+
"metadata": {},
|
| 167 |
+
"source": [
|
| 168 |
+
"<details>\n",
|
| 169 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 170 |
+
"\n",
|
| 171 |
+
"```python\n",
|
| 172 |
+
"param_grid = {\n",
|
| 173 |
+
" 'n_estimators': [50, 100],\n",
|
| 174 |
+
" 'max_depth': [10, 20, None]\n",
|
| 175 |
+
"}\n",
|
| 176 |
+
"\n",
|
| 177 |
+
"rf = RandomForestClassifier(random_state=42)\n",
|
| 178 |
+
"grid = GridSearchCV(rf, param_grid, cv=3, scoring='f1')\n",
|
| 179 |
+
"grid.fit(X_train, y_train)\n",
|
| 180 |
+
"\n",
|
| 181 |
+
"print(\"Best params:\", grid.best_params_)\n",
|
| 182 |
+
"best_model = grid.best_estimator_\n",
|
| 183 |
+
"```\n",
|
| 184 |
+
"</details>"
|
| 185 |
+
]
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"cell_type": "markdown",
|
| 189 |
+
"metadata": {},
|
| 190 |
+
"source": [
|
| 191 |
+
"## Phase 5: Model Evaluation\n",
|
| 192 |
+
"\n",
|
| 193 |
+
"### Task 4: Comprehensive Metrics\n",
|
| 194 |
+
"Evaluate with Confusion Matrix, Classification Report, and ROC-AUC."
|
| 195 |
+
]
|
| 196 |
+
},
|
| 197 |
+
{
|
| 198 |
+
"cell_type": "code",
|
| 199 |
+
"execution_count": null,
|
| 200 |
+
"metadata": {},
|
| 201 |
+
"outputs": [],
|
| 202 |
+
"source": [
|
| 203 |
+
"# YOUR CODE HERE"
|
| 204 |
+
]
|
| 205 |
+
},
|
| 206 |
+
{
|
| 207 |
+
"cell_type": "markdown",
|
| 208 |
+
"metadata": {},
|
| 209 |
+
"source": [
|
| 210 |
+
"<details>\n",
|
| 211 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 212 |
+
"\n",
|
| 213 |
+
"```python\n",
|
| 214 |
+
"y_pred = best_model.predict(X_test)\n",
|
| 215 |
+
"\n",
|
| 216 |
+
"print(classification_report(y_test, y_pred))\n",
|
| 217 |
+
"print(\"ROC-AUC:\", roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]))\n",
|
| 218 |
+
"\n",
|
| 219 |
+
"sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')\n",
|
| 220 |
+
"plt.show()\n",
|
| 221 |
+
"```\n",
|
| 222 |
+
"</details>"
|
| 223 |
+
]
|
| 224 |
+
},
|
| 225 |
+
{
|
| 226 |
+
"cell_type": "markdown",
|
| 227 |
+
"metadata": {},
|
| 228 |
+
"source": [
|
| 229 |
+
"## Phase 6: Model Persistence\n",
|
| 230 |
+
"\n",
|
| 231 |
+
"### Task 5: Save the Pipeline\n",
|
| 232 |
+
"Save the scaler and model for production deployment."
|
| 233 |
+
]
|
| 234 |
+
},
|
| 235 |
+
{
|
| 236 |
+
"cell_type": "code",
|
| 237 |
+
"execution_count": null,
|
| 238 |
+
"metadata": {},
|
| 239 |
+
"outputs": [],
|
| 240 |
+
"source": [
|
| 241 |
+
"# YOUR CODE HERE"
|
| 242 |
+
]
|
| 243 |
+
},
|
| 244 |
+
{
|
| 245 |
+
"cell_type": "markdown",
|
| 246 |
+
"metadata": {},
|
| 247 |
+
"source": [
|
| 248 |
+
"<details>\n",
|
| 249 |
+
"<summary><b>Click to see Solution</b></summary>\n",
|
| 250 |
+
"\n",
|
| 251 |
+
"```python\n",
|
| 252 |
+
"joblib.dump(best_model, 'fraud_model.pkl')\n",
|
| 253 |
+
"joblib.dump(scaler, 'scaler.pkl')\n",
|
| 254 |
+
"print(\"Production artifacts saved!\")\n",
|
| 255 |
+
"```\n",
|
| 256 |
+
"</details>"
|
| 257 |
+
]
|
| 258 |
+
},
|
| 259 |
+
{
|
| 260 |
+
"cell_type": "markdown",
|
| 261 |
+
"metadata": {},
|
| 262 |
+
"source": [
|
| 263 |
+
"--- \n",
|
| 264 |
+
"### π CONGRATULATIONS! \n",
|
| 265 |
+
"You have completed the **ENTIRE 26-MODULE CURRICULUM**. \n",
|
| 266 |
+
"\n",
|
| 267 |
+
"You are now ready to:\n",
|
| 268 |
+
"- Build production ML systems\n",
|
| 269 |
+
"- Compete in Kaggle competitions\n",
|
| 270 |
+
"- Interview for Data Scientist roles\n",
|
| 271 |
+
"- Deploy models to the real world\n",
|
| 272 |
+
"\n",
|
| 273 |
+
"**Your journey has just begun!** π"
|
| 274 |
+
]
|
| 275 |
+
}
|
| 276 |
+
],
|
| 277 |
+
"metadata": {
|
| 278 |
+
"kernelspec": {
|
| 279 |
+
"display_name": "Python 3",
|
| 280 |
+
"language": "python",
|
| 281 |
+
"name": "python3"
|
| 282 |
+
},
|
| 283 |
+
"language_info": {
|
| 284 |
+
"codemirror_mode": {
|
| 285 |
+
"name": "ipython",
|
| 286 |
+
"version": 3
|
| 287 |
+
},
|
| 288 |
+
"file_extension": ".py",
|
| 289 |
+
"mimetype": "text/x-python",
|
| 290 |
+
"name": "python",
|
| 291 |
+
"nbconvert_exporter": "python",
|
| 292 |
+
"pygments_lexer": "ipython3",
|
| 293 |
+
"version": "3.12.7"
|
| 294 |
+
}
|
| 295 |
+
},
|
| 296 |
+
"nbformat": 4,
|
| 297 |
+
"nbformat_minor": 4
|
| 298 |
+
}
|
ML/CURRICULUM_REVIEW.md
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Complete Curriculum Review: All 23 Modules
|
| 2 |
+
|
| 3 |
+
This document provides a comprehensive review of your entire Data Science & Machine Learning practice curriculum.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## π Module Overview & Quality Assessment
|
| 8 |
+
|
| 9 |
+
### **Phase 1: Foundations (Modules 01-02)** β
|
| 10 |
+
|
| 11 |
+
#### **Module 01: Python Core Mastery**
|
| 12 |
+
- **Status**: β
COMPLETE (World-Class)
|
| 13 |
+
- **Concepts Covered**:
|
| 14 |
+
- Basic: Strings, F-Strings, Slicing, Data Structures
|
| 15 |
+
- Intermediate: Comprehensions, Generators, Decorators
|
| 16 |
+
- Advanced: OOP (Dunder Methods, Static Methods), Async/Await
|
| 17 |
+
- Expert: Multithreading vs Multiprocessing (GIL), Singleton Pattern
|
| 18 |
+
- **Strengths**: Covers beginner to architectural patterns. Industry-ready.
|
| 19 |
+
- **Website Integration**: N/A (Core Python)
|
| 20 |
+
- **Recommendation**: **Perfect foundation. No changes needed.**
|
| 21 |
+
|
| 22 |
+
#### **Module 02: Statistics Foundations**
|
| 23 |
+
- **Status**: β
COMPLETE (Enhanced)
|
| 24 |
+
- **Concepts Covered**:
|
| 25 |
+
- Central Tendency (Mean, Median, Mode)
|
| 26 |
+
- Dispersion (Std Dev, IQR)
|
| 27 |
+
- Z-Scores & Outlier Detection
|
| 28 |
+
- Correlation & Hypothesis Testing (p-values)
|
| 29 |
+
- **Strengths**: Includes advanced stats (hypothesis testing, correlation).
|
| 30 |
+
- **Website Integration**: β
Links to [Complete Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/)
|
| 31 |
+
- **Recommendation**: **Excellent. Ready for use.**
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
### **Phase 2: Data Science Toolbox (Modules 03-07)** β
|
| 36 |
+
|
| 37 |
+
#### **Module 03: NumPy Practice**
|
| 38 |
+
- **Status**: β
COMPLETE
|
| 39 |
+
- **Concepts**: Arrays, Broadcasting, Matrix Operations, Statistics
|
| 40 |
+
- **Website Integration**: β
Links to Math for Data Science
|
| 41 |
+
- **Recommendation**: **Good coverage of NumPy essentials.**
|
| 42 |
+
|
| 43 |
+
#### **Module 04: Pandas Practice**
|
| 44 |
+
- **Status**: β
COMPLETE
|
| 45 |
+
- **Concepts**: DataFrames, Filtering, GroupBy, Merging
|
| 46 |
+
- **Website Integration**: β
Links to Feature Engineering Guide
|
| 47 |
+
- **Recommendation**: **Solid foundation for data manipulation.**
|
| 48 |
+
|
| 49 |
+
#### **Module 05: Matplotlib & Seaborn Practice**
|
| 50 |
+
- **Status**: β
COMPLETE
|
| 51 |
+
- **Concepts**: Line/Scatter plots, Distributions, Categorical plots, Pair plots
|
| 52 |
+
- **Website Integration**: β
Links to Visualization section
|
| 53 |
+
- **Recommendation**: **Great visual exploration coverage.**
|
| 54 |
+
|
| 55 |
+
#### **Module 06: EDA & Feature Engineering**
|
| 56 |
+
- **Status**: β
COMPLETE (Titanic Dataset)
|
| 57 |
+
- **Concepts**: Missing values, Distributions, Encoding, Feature creation
|
| 58 |
+
- **Website Integration**: β
Links to Feature Engineering Guide
|
| 59 |
+
- **Recommendation**: **Excellent hands-on with real data.**
|
| 60 |
+
|
| 61 |
+
#### **Module 07: Scikit-Learn Practice**
|
| 62 |
+
- **Status**: β
COMPLETE
|
| 63 |
+
- **Concepts**: Train-test split, Pipelines, Cross-validation, GridSearch
|
| 64 |
+
- **Website Integration**: β
Links to ML Guide
|
| 65 |
+
- **Recommendation**: **Essential utilities well covered.**
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
### **Phase 3: Supervised Learning (Modules 08-14)** β
|
| 70 |
+
|
| 71 |
+
#### **Module 08: Linear Regression**
|
| 72 |
+
- **Status**: β
COMPLETE (Diamonds Dataset)
|
| 73 |
+
- **Concepts**: Encoding, Model training, R2 Score, RMSE
|
| 74 |
+
- **Website Integration**: β
Links to Math for DS (Optimization)
|
| 75 |
+
- **Recommendation**: **Good regression intro.**
|
| 76 |
+
|
| 77 |
+
#### **Module 09: Logistic Regression**
|
| 78 |
+
- **Status**: β
COMPLETE (Breast Cancer Dataset)
|
| 79 |
+
- **Concepts**: Scaling, Binary classification, Confusion Matrix, ROC
|
| 80 |
+
- **Website Integration**: β
Links to ML Guide
|
| 81 |
+
- **Recommendation**: **Strong classification foundation.**
|
| 82 |
+
|
| 83 |
+
#### **Module 10: Support Vector Machines (SVM)**
|
| 84 |
+
- **Status**: β
COMPLETE (Moons Dataset)
|
| 85 |
+
- **Concepts**: Linear vs kernel SVMs, RBF kernel, C parameter tuning
|
| 86 |
+
- **Website Integration**: β
Links to ML Guide
|
| 87 |
+
- **Recommendation**: **Good kernel trick demonstration.**
|
| 88 |
+
|
| 89 |
+
#### **Module 11: K-Nearest Neighbors (KNN)**
|
| 90 |
+
- **Status**: β
COMPLETE (Iris Dataset)
|
| 91 |
+
- **Concepts**: Distance metrics, Elbow method for K, Scaling importance
|
| 92 |
+
- **Website Integration**: β
Links to ML Guide
|
| 93 |
+
- **Recommendation**: **Clear instance-based learning example.**
|
| 94 |
+
|
| 95 |
+
#### **Module 12: Naive Bayes**
|
| 96 |
+
- **Status**: β
COMPLETE (Text/Spam Dataset)
|
| 97 |
+
- **Concepts**: Bayes Theorem, Text vectorization, Multinomial NB
|
| 98 |
+
- **Website Integration**: β
Links to ML Guide
|
| 99 |
+
- **Recommendation**: **Good intro to probabilistic models.**
|
| 100 |
+
|
| 101 |
+
#### **Module 13: Decision Trees & Random Forests**
|
| 102 |
+
- **Status**: β
COMPLETE (Penguins Dataset)
|
| 103 |
+
- **Concepts**: Tree visualization, Feature importance, Ensemble methods
|
| 104 |
+
- **Website Integration**: β
Links to ML Guide
|
| 105 |
+
- **Recommendation**: **Strong tree-based model coverage.**
|
| 106 |
+
|
| 107 |
+
#### **Module 14: Gradient Boosting & XGBoost**
|
| 108 |
+
- **Status**: β
COMPLETE (Wine Dataset)
|
| 109 |
+
- **Concepts**: Boosting principle, GradientBoosting, XGBoost
|
| 110 |
+
- **Website Integration**: β
Links to ML Guide
|
| 111 |
+
- **Note**: Requires `pip install xgboost`
|
| 112 |
+
- **Recommendation**: **Critical Kaggle-level skill included.**
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
### **Phase 4: Unsupervised Learning (Modules 15-16)** β
|
| 117 |
+
|
| 118 |
+
#### **Module 15: K-Means Clustering**
|
| 119 |
+
- **Status**: β
COMPLETE (Synthetic Data)
|
| 120 |
+
- **Concepts**: Elbow method, Cluster visualization
|
| 121 |
+
- **Website Integration**: β
Links to ML Guide
|
| 122 |
+
- **Recommendation**: **Good clustering intro.**
|
| 123 |
+
|
| 124 |
+
#### **Module 16: Dimensionality Reduction (PCA)**
|
| 125 |
+
- **Status**: β
COMPLETE (Digits Dataset)
|
| 126 |
+
- **Concepts**: 2D projection, Scree plot, Explained variance
|
| 127 |
+
- **Website Integration**: β
Links to Math for DS (Linear Algebra)
|
| 128 |
+
- **Recommendation**: **Excellent PCA explanation.**
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
### **Phase 5: Advanced ML (Modules 17-20)** β
|
| 133 |
+
|
| 134 |
+
#### **Module 17: Neural Networks & Deep Learning**
|
| 135 |
+
- **Status**: β
COMPLETE (MNIST)
|
| 136 |
+
- **Concepts**: MLPClassifier, Hidden layers, Activation functions
|
| 137 |
+
- **Website Integration**: β
Links to Math for DS (Calculus)
|
| 138 |
+
- **Recommendation**: **Good foundation for DL.**
|
| 139 |
+
|
| 140 |
+
#### **Module 18: Time Series Analysis**
|
| 141 |
+
- **Status**: β
COMPLETE (Air Passengers Dataset)
|
| 142 |
+
- **Concepts**: Datetime handling, Rolling windows, Trend smoothing
|
| 143 |
+
- **Website Integration**: β
Links to Feature Engineering
|
| 144 |
+
- **Recommendation**: **Good temporal data intro.**
|
| 145 |
+
|
| 146 |
+
#### **Module 19: Natural Language Processing (NLP)**
|
| 147 |
+
- **Status**: β
COMPLETE (Movie Reviews)
|
| 148 |
+
- **Concepts**: TF-IDF, Sentiment analysis, Text classification
|
| 149 |
+
- **Website Integration**: β
Links to ML Guide
|
| 150 |
+
- **Recommendation**: **Solid NLP foundation.**
|
| 151 |
+
|
| 152 |
+
#### **Module 20: Reinforcement Learning Basics**
|
| 153 |
+
- **Status**: β
COMPLETE (Grid World)
|
| 154 |
+
- **Concepts**: Q-Learning, Agent-environment loop, Epsilon-greedy
|
| 155 |
+
- **Website Integration**: β
Links to ML Guide
|
| 156 |
+
- **Recommendation**: **Great RL introduction from scratch.**
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
### **Phase 6: Industry Skills (Modules 21-23)** β
|
| 161 |
+
|
| 162 |
+
#### **Module 21: Kaggle Project (Medical Costs)**
|
| 163 |
+
- **Status**: β
COMPLETE (External Dataset)
|
| 164 |
+
- **Concepts**: Full pipeline, EDA, Feature engineering, Random Forest
|
| 165 |
+
- **Website Integration**: β
Links to multiple sections
|
| 166 |
+
- **Recommendation**: **Excellent capstone project.**
|
| 167 |
+
|
| 168 |
+
#### **Module 22: SQL for Data Science**
|
| 169 |
+
- **Status**: β
COMPLETE (SQLite)
|
| 170 |
+
- **Concepts**: SQL queries, `pd.read_sql_query`, Database basics
|
| 171 |
+
- **Website Integration**: N/A (Core skill)
|
| 172 |
+
- **Recommendation**: **Critical industry gap filled.**
|
| 173 |
+
|
| 174 |
+
#### **Module 23: Model Explainability (SHAP)**
|
| 175 |
+
- **Status**: β
COMPLETE (Breast Cancer)
|
| 176 |
+
- **Concepts**: SHAP values, Global/local interpretability, Force plots
|
| 177 |
+
- **Website Integration**: N/A (Advanced library)
|
| 178 |
+
- **Note**: Requires `pip install shap`
|
| 179 |
+
- **Recommendation**: **Elite-level XAI skill. Excellent addition.**
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## β
Overall Curriculum Assessment
|
| 184 |
+
|
| 185 |
+
### **Strengths**:
|
| 186 |
+
1. β
**Comprehensive Coverage**: From Python basics to Advanced XAI.
|
| 187 |
+
2. β
**Website Integration**: All modules link to [DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/).
|
| 188 |
+
3. β
**Hands-On**: Every module uses real datasets (Titanic, MNIST, Kaggle, etc.).
|
| 189 |
+
4. β
**Progressive Difficulty**: Perfect learning curve from beginner to expert.
|
| 190 |
+
5. β
**Industry-Ready**: Includes SQL, Explainability, and Design Patterns.
|
| 191 |
+
|
| 192 |
+
### **Missing/Optional Enhancements**:
|
| 193 |
+
1. β οΈ **Deep Learning Frameworks**: Consider adding separate TensorFlow/PyTorch modules (optional).
|
| 194 |
+
2. β οΈ **Model Deployment**: Add a Streamlit or FastAPI deployment module (optional).
|
| 195 |
+
3. β οΈ **Big Data**: Spark/Dask for large-scale processing (advanced, optional).
|
| 196 |
+
|
| 197 |
+
### **Dependencies Check**:
|
| 198 |
+
Update `requirements.txt` to ensure it includes:
|
| 199 |
+
```
|
| 200 |
+
xgboost
|
| 201 |
+
shap
|
| 202 |
+
scipy
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## π― Final Verdict
|
| 208 |
+
|
| 209 |
+
**Grade**: **A+ (Exceptional)**
|
| 210 |
+
|
| 211 |
+
This is a **production-ready, professional-grade Data Science curriculum**. It covers:
|
| 212 |
+
- β
All fundamental concepts
|
| 213 |
+
- β
All major algorithms
|
| 214 |
+
- β
Industry best practices
|
| 215 |
+
- β
Advanced architectural patterns
|
| 216 |
+
- β
External data integration
|
| 217 |
+
|
| 218 |
+
**Recommendation**: This curriculum is ready for immediate use. You can start with Module 01 and work sequentially through Module 23.
|
| 219 |
+
|
| 220 |
+
**Next Steps**:
|
| 221 |
+
1. Update `requirements.txt` (I'll do this now)
|
| 222 |
+
2. Start practicing from Module 01
|
| 223 |
+
3. Optional: Add deployment module later if needed
|
| 224 |
+
|
| 225 |
+
---
|
| 226 |
+
|
| 227 |
+
*Review Date: 2025-12-20*
|
| 228 |
+
*Total Modules: 23*
|
| 229 |
+
*Status: β
PRODUCTION READY*
|
ML/README.md
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Complete Machine Learning & Data Science Curriculum
|
| 2 |
+
|
| 3 |
+
## **26 Modules β’ From Zero to Production-Ready ML Engineer**
|
| 4 |
+
|
| 5 |
+
Welcome to the most comprehensive, hands-on Data Science practice curriculum ever created. This series takes you from **Core Python** to deploying **production ML systems**.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## π **Curriculum Structure**
|
| 10 |
+
|
| 11 |
+
### **π Phase 1: Foundations (Modules 01-02)**
|
| 12 |
+
|
| 13 |
+
1. **[01_Python_Core_Mastery.ipynb](./01_Python_Core_Mastery.ipynb)**
|
| 14 |
+
- **Basics**: Strings, F-Strings, Slicing, Data Structures
|
| 15 |
+
- **Intermediate**: Comprehensions, Generators, Decorators
|
| 16 |
+
- **Advanced**: OOP (Dunder Methods, Static Methods), Async/Await
|
| 17 |
+
- **Expert**: Multithreading vs Multiprocessing (GIL), Singleton Pattern
|
| 18 |
+
|
| 19 |
+
2. **[02_Statistics_Foundations.ipynb](./02_Statistics_Foundations.ipynb)**
|
| 20 |
+
- Central Tendency, Dispersion, Z-Scores
|
| 21 |
+
- Correlation, Hypothesis Testing (p-values)
|
| 22 |
+
- Links: [Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/)
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
### **π§ Phase 2: Data Science Toolbox (Modules 03-07)**
|
| 27 |
+
|
| 28 |
+
3. **[03_NumPy_Practice.ipynb](./03_NumPy_Practice.ipynb)** - Numerical Computing
|
| 29 |
+
4. **[04_Pandas_Practice.ipynb](./04_Pandas_Practice.ipynb)** - Data Manipulation
|
| 30 |
+
5. **[05_Matplotlib_Seaborn_Practice.ipynb](./05_Matplotlib_Seaborn_Practice.ipynb)** - Visualization
|
| 31 |
+
6. **[06_EDA_and_Feature_Engineering.ipynb](./06_EDA_and_Feature_Engineering.ipynb)** - Real Titanic Dataset
|
| 32 |
+
7. **[07_Scikit_Learn_Practice.ipynb](./07_Scikit_Learn_Practice.ipynb)** - Pipelines & GridSearch
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
### **π€ Phase 3: Supervised Learning (Modules 08-14)**
|
| 37 |
+
|
| 38 |
+
8. **[08_Linear_Regression.ipynb](./08_Linear_Regression.ipynb)** - Diamonds Dataset
|
| 39 |
+
9. **[09_Logistic_Regression.ipynb](./09_Logistic_Regression.ipynb)** - Breast Cancer Dataset
|
| 40 |
+
10. **[10_Support_Vector_Machines.ipynb](./10_Support_Vector_Machines.ipynb)** - Kernel Trick
|
| 41 |
+
11. **[11_K_Nearest_Neighbors.ipynb](./11_K_Nearest_Neighbors.ipynb)** - Iris Dataset
|
| 42 |
+
12. **[12_Naive_Bayes.ipynb](./12_Naive_Bayes.ipynb)** - Text Classification
|
| 43 |
+
13. **[13_Decision_Trees_and_Random_Forests.ipynb](./13_Decision_Trees_and_Random_Forests.ipynb)** - Penguins Dataset
|
| 44 |
+
14. **[14_Gradient_Boosting_XGBoost.ipynb](./14_Gradient_Boosting_XGBoost.ipynb)** - Kaggle Champion
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
### **π Phase 4: Unsupervised Learning (Modules 15-16)**
|
| 49 |
+
|
| 50 |
+
15. **[15_KMeans_Clustering.ipynb](./15_KMeans_Clustering.ipynb)** - Elbow Method
|
| 51 |
+
16. **[16_Dimensionality_Reduction_PCA.ipynb](./16_Dimensionality_Reduction_PCA.ipynb)** - Digits Dataset
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
### **π§ Phase 5: Advanced ML (Modules 17-20)**
|
| 56 |
+
|
| 57 |
+
17. **[17_Neural_Networks_Deep_Learning.ipynb](./17_Neural_Networks_Deep_Learning.ipynb)** - MNIST with MLPClassifier
|
| 58 |
+
18. **[18_Time_Series_Analysis.ipynb](./18_Time_Series_Analysis.ipynb)** - Air Passengers Dataset
|
| 59 |
+
19. **[19_Natural_Language_Processing_NLP.ipynb](./19_Natural_Language_Processing_NLP.ipynb)** - Sentiment Analysis
|
| 60 |
+
20. **[20_Reinforcement_Learning_Basics.ipynb](./20_Reinforcement_Learning_Basics.ipynb)** - Q-Learning Grid World
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
### **πΌ Phase 6: Industry Skills (Modules 21-23)**
|
| 65 |
+
|
| 66 |
+
21. **[21_Kaggle_Project_Medical_Costs.ipynb](./21_Kaggle_Project_Medical_Costs.ipynb)** - Full Pipeline
|
| 67 |
+
22. **[22_SQL_for_Data_Science.ipynb](./22_SQL_for_Data_Science.ipynb)** - Database Integration
|
| 68 |
+
23. **[23_Model_Explainability_SHAP.ipynb](./23_Model_Explainability_SHAP.ipynb)** - XAI with SHAP
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
### **π Phase 7: Production & Deployment (Modules 24-26)** β NEW!
|
| 73 |
+
|
| 74 |
+
24. **[24_Deep_Learning_TensorFlow.ipynb](./24_Deep_Learning_TensorFlow.ipynb)** - TensorFlow/Keras & CNNs
|
| 75 |
+
25. **[25_Model_Deployment_Streamlit.ipynb](./25_Model_Deployment_Streamlit.ipynb)** - Web App Deployment
|
| 76 |
+
26. **[26_End_to_End_ML_Project.ipynb](./26_End_to_End_ML_Project.ipynb)** - Production Pipeline
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## π οΈ **Setup Instructions**
|
| 81 |
+
|
| 82 |
+
### **1. Install Dependencies**
|
| 83 |
+
```bash
|
| 84 |
+
pip install -r requirements.txt
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### **2. Launch Jupyter**
|
| 88 |
+
```bash
|
| 89 |
+
jupyter notebook
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
### **3. Start Learning!**
|
| 93 |
+
Open `01_Python_Core_Mastery.ipynb` and work sequentially through Module 26.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## π **Website Integration**
|
| 98 |
+
|
| 99 |
+
This curriculum is designed to work seamlessly with the **[DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/)**. Each ML module links to interactive visualizations and theory.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## π **What Makes This Curriculum Unique?**
|
| 104 |
+
|
| 105 |
+
β
**26 Complete Modules** - From Python basics to production deployment
|
| 106 |
+
β
**Real Datasets** - Titanic, MNIST, Kaggle Insurance, and more
|
| 107 |
+
β
**Website Integration** - Links to visual demos for every concept
|
| 108 |
+
β
**Industry-Ready** - Includes SQL, SHAP, Design Patterns, Async programming
|
| 109 |
+
β
**Production Skills** - TensorFlow, Streamlit, Model Deployment
|
| 110 |
+
β
**Git-Ready** - Initialized with version control
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## π **Key Files**
|
| 115 |
+
|
| 116 |
+
- **[CURRICULUM_REVIEW.md](./CURRICULUM_REVIEW.md)** - Quality assessment of all modules
|
| 117 |
+
- **[README_Resources.md](./README_Resources.md)** - External learning resources
|
| 118 |
+
- **[requirements.txt](./requirements.txt)** - All dependencies
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## π― **Who Is This For?**
|
| 123 |
+
|
| 124 |
+
- π **Students** learning Data Science from scratch
|
| 125 |
+
- πΌ **Professionals** preparing for DS/ML interviews
|
| 126 |
+
- π§βπ» **Developers** transitioning to ML engineering
|
| 127 |
+
- π **Kagglers** wanting structured practice
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## π **Learning Path**
|
| 132 |
+
|
| 133 |
+
**Beginner** (Weeks 1-4): Modules 01-07
|
| 134 |
+
**Intermediate** (Weeks 5-8): Modules 08-16
|
| 135 |
+
**Advanced** (Weeks 9-12): Modules 17-23
|
| 136 |
+
**Expert** (Weeks 13-14): Modules 24-26
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|
| 140 |
+
## π **After Completion**
|
| 141 |
+
|
| 142 |
+
You will be able to:
|
| 143 |
+
- β
Build end-to-end ML systems
|
| 144 |
+
- β
Deploy models as web applications
|
| 145 |
+
- β
Compete in Kaggle competitions
|
| 146 |
+
- β
Pass ML engineering interviews
|
| 147 |
+
- β
Explain model decisions with SHAP
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## π€ **Contributing**
|
| 152 |
+
|
| 153 |
+
This curriculum is part of a personal learning journey integrated with [aashishgarg13.github.io/DataScience/](https://aashishgarg13.github.io/DataScience/).
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## π **License**
|
| 158 |
+
|
| 159 |
+
For educational purposes. Feel free to learn and adapt!
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
**Ready to become a Machine Learning Engineer?** Start with `01_Python_Core_Mastery.ipynb`! π
|
ML/README_Resources.md
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Professional Data Science Resource Masterlist
|
| 2 |
+
|
| 3 |
+
This document provides a curated list of high-quality resources to supplement your practice notebooks and your **[DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/)**.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## ποΈ Core Tool Cheatsheets (PDFs & Docs)
|
| 8 |
+
* **NumPy**: [Official Cheatsheet](https://numpy.org/doc/stable/user/basics.creations.html) β Arrays, Slicing, Math.
|
| 9 |
+
* **Pandas**: [Pandas Comparison to SQL](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html) β Essential for SQL users.
|
| 10 |
+
* **Matplotlib**: [Usage Guide](https://matplotlib.org/stable/tutorials/introductory/usage.html) β Anatomy of a figure.
|
| 11 |
+
* **Scikit-Learn**: [Choosing the Right Estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) β **Legendary Flowchart**.
|
| 12 |
+
|
| 13 |
+
## π§ Theory & Concept Deep-Dives
|
| 14 |
+
* **Stats**: [Seeing Theory](https://seeing-theory.brown.edu/) β Beautiful visual statistics.
|
| 15 |
+
* **Calculus/Linear Algebra**: [3Blue1Brown (YouTube)](https://www.youtube.com/@3blue1brown) β The best visual explanations for ML math.
|
| 16 |
+
* **XGBoost/Boosting**: [The XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) β Understanding the math of boosting.
|
| 17 |
+
|
| 18 |
+
## π Practice & Challenges (Beyond this Series)
|
| 19 |
+
* **Kaggle**: [Kaggle Learn](https://www.kaggle.com/learn) β Micro-courses for specific skills.
|
| 20 |
+
* **UCI ML Repository**: [Dataset Finder](https://archive.ics.uci.edu/ml/datasets.php) β The best place for "classic" datasets.
|
| 21 |
+
* **Machine Learning Mastery**: [Jason Brownlee's Blog](https://machinelearningmastery.com/) β Practical, code-heavy tutorials.
|
| 22 |
+
|
| 23 |
+
## π οΈ Deployment & MLOps
|
| 24 |
+
* **FastAPI**: [Official Tutorial](https://fastapi.tiangolo.com/tutorial/) β Deploy your models as APIs.
|
| 25 |
+
* **Streamlit**: [Build ML Web Apps](https://streamlit.io/) β Turn your notebooks into beautiful data apps.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
> **Note**: Always keep your **[Learning Hub](https://aashishgarg13.github.io/DataScience/)** open while you work. It is specifically designed to be your primary companion for these 20 practice modules!
|
ML/requirements.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
pandas
|
| 2 |
+
numpy
|
| 3 |
+
matplotlib
|
| 4 |
+
seaborn
|
| 5 |
+
scikit-learn
|
| 6 |
+
scipy
|
| 7 |
+
xgboost
|
| 8 |
+
shap
|
| 9 |
+
tensorflow
|
| 10 |
+
streamlit
|
| 11 |
+
joblib
|
| 12 |
+
notebook
|
| 13 |
+
ipykernel
|