Spaces:
Running
Running
File size: 7,552 Bytes
854c114 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | {
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Library Practice: Scikit-Learn (Utilities)\n",
"\n",
"While we've covered many algorithms, Scikit-Learn also provides vital utilities for data splitting, pipelines, and hyperparameter tuning.\n",
"\n",
"### Resources:\n",
"Refer to the **[Machine Learning Guide](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for conceptual workflows of cross-validation and preprocessing.\n",
"\n",
"### Objectives:\n",
"1. **Train-Test Split**: Dividing data for validation.\n",
"2. **Pipelines**: Chaining preprocessing and modeling.\n",
"3. **Cross-Validation**: Robust model evaluation.\n",
"4. **Grid Search**: Automated hyperparameter tuning.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Data Splitting\n",
"\n",
"### Task 1: Scaled Split\n",
"Using the provided data, split it into 70% train and 30% test, ensuring the split is reproducible."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.datasets import make_classification\n",
"\n",
"X, y = make_classification(n_samples=1000, n_features=10, random_state=42)\n",
"\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n",
"print(f\"Train size: {len(X_train)}, Test size: {len(X_test)}\")\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Model Pipelines\n",
"\n",
"### Task 2: Create a Pipeline\n",
"Build a pipeline that combines `StandardScaler` and `LogisticRegression`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('model', LogisticRegression())\n",
"])\n",
"pipeline.fit(X_train, y_train)\n",
"print(\"Model Score:\", pipeline.score(X_test, y_test))\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Cross-Validation\n",
"\n",
"### Task 3: 5-Fold Evaluation\n",
"Evaluate a `RandomForestClassifier` using 5-fold cross-validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"rf = RandomForestClassifier(n_estimators=100)\n",
"\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"scores = cross_val_score(rf, X, y, cv=5)\n",
"print(\"Cross-validation scores:\", scores)\n",
"print(\"Mean accuracy:\", scores.mean())\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Hyperparameter Tuning\n",
"\n",
"### Task 4: Grid Search\n",
"Use `GridSearchCV` to find the best `max_depth` (3, 5, 10, None) for a Decision Tree."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"dt = DecisionTreeClassifier()\n",
"params = {'max_depth': [3, 5, 10, None]}\n",
"\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"grid = GridSearchCV(dt, params, cv=5)\n",
"grid.fit(X, y)\n",
"print(\"Best parameters:\", grid.best_params_)\n",
"print(\"Best score:\", grid.best_score_)\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### Excellent Utility Practice! \n",
"Using these tools ensures your ML experiments are robust and organized. \n",
"You have now covered all the core libraries!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
} |