{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 02 - Linear Regression\n",
"\n",
"In this module, we will explore **Linear Regression**, one of the most fundamental algorithms in Machine Learning used for predicting continuous values.\n",
"\n",
"### Resources:\n",
"Check out the [Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/) section on your hub to understand the Linear Algebra and Optimization (Gradient Descent) behind Linear Regression.\n",
"\n",
"### Objectives:\n",
"1. **Preprocessing**: Prepare numeric and categorical features.\n",
"2. **Splitting**: Divide data into training and testing sets.\n",
"3. **Training**: Fit a Linear Regression model.\n",
"4. **Evaluation**: Use metrics like R-squared and Root Mean Squared Error (RMSE).\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup\n",
"We will use the `diamonds` dataset to predict the `price` of diamonds based on their features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"# Load dataset\n",
"df = sns.load_dataset('diamonds')\n",
"print(\"Dataset Shape:\", df.shape)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Preprocessing\n",
"\n",
"### Task 1: Encode Categorical Variables\n",
"The columns `cut`, `color`, and `clarity` are categorical. Use One-Hot Encoding to convert them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"df_encoded = pd.get_dummies(df, columns=['cut', 'color', 'clarity'], drop_first=True)\n",
"df_encoded.head()\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 2: Features and Target Selection\n",
"Define `X` (features) and `y` (target: 'price')."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"X = df_encoded.drop('price', axis=1)\n",
"y = df_encoded['price']\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 3: Train-Test Split\n",
"Split the data into 80% training and 20% testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"print(f\"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Modeling\n",
"\n",
"### Task 4: Training the Model\n",
"Create a LinearRegression object and fit it on the training data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"model = LinearRegression()\n",
"model.fit(X_train, y_train)\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 5: Making Predictions\n",
"Predict the values for the test set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"y_pred = model.predict(X_test)\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Evaluation\n",
"\n",
"### Task 6: Error Metrics\n",
"Calculate R2 Score and RMSE."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"r2 = r2_score(y_test, y_pred)\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"\n",
"print(f\"R2 Score: {r2:.4f}\")\n",
"print(f\"RMSE: {rmse:.2f}\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### Well Done! \n",
"You have successfully built and evaluated a Linear Regression model. \n",
"Next module: **Logistic Regression** for classification!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}