{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ML Practice Series: Module 02 - Statistical Foundations\n", "\n", "Before diving into Machine Learning, it's essential to understand the data through **Statistics**. This module covers the foundational concepts you'll need for data analysis.\n", "\n", "### Resources:\n", "Refer to the **[Complete Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/)** on your hub for interactive demos on Population vs. Sample, Central Tendency, and Dispersion.\n", "\n", "### Objectives:\n", "1. **Central Tendency**: Mean, Median, and Mode.\n", "2. **Dispersion**: Standard Deviation, Variance, and IQR.\n", "3. **Probability Distributions**: Normal Distribution and Z-Scores.\n", "4. **Hypothesis Testing**: Understanding p-values.\n", "5. **Correlation**: Relationship between variables.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from scipy import stats\n", "\n", "np.random.seed(42)\n", "data = np.random.normal(loc=100, scale=15, size=1000)\n", "df = pd.DataFrame(data, columns=['Score'])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Central Tendency & Dispersion\n", "\n", "### Task 1: Basic Stats\n", "Calculate the Mean, Median, and Standard Deviation of the `Score` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "print(f\"Mean: {df['Score'].mean()}\")\n", "print(f\"Median: {df['Score'].median()}\")\n", "print(f\"Std Dev: {df['Score'].std()}\")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Z-Scores & Outliers\n", "\n", "### Task 2: Finding Outliers\n", "A point is often considered an outlier if its Z-score is greater than 3 or less than -3. Help identify any outliers in the dataset.\n", "\n", "*Web Reference: [Outlier Detection Demo](https://aashishgarg13.github.io/DataScience/feature-engineering/)*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "df['z_score'] = stats.zscore(df['Score'])\n", "outliers = df[df['z_score'].abs() > 3]\n", "print(f\"Number of outliers: {len(outliers)}\")\n", "print(outliers)\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Correlation\n", "\n", "### Task 3: Correlation Matrix\n", "Generate a second column `StudyTime` that is correlated with `Score` and calculate the Pearson correlation coefficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['StudyTime'] = df['Score'] * 0.5 + np.random.normal(0, 5, 1000)\n", "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "correlation = df.corr()\n", "print(correlation)\n", "sns.heatmap(correlation, annot=True, cmap='coolwarm')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Hypothesis Testing (p-values)\n", "\n", "### Task 4: T-Test\n", "Test if the mean of our `Score` is significantly different from 100 using a 1-sample T-test. What is the p-value?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "t_stat, p_val = stats.ttest_1samp(df['Score'], 100)\n", "print(f\"T-statistic: {t_stat}\")\n", "print(f\"P-value: {p_val}\")\n", "if p_val < 0.05:\n", " print(\"Statistically significant difference!\")\n", "else:\n", " print(\"No significant difference.\")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "### Foundational Knowledge Unlocked! \n", "You have now mastered the mathematical core of data analysis.\n", "Next: **NumPy Mastery**." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }