Spaces:

AashishAIHub
/

DataScience

Running

File size: 7,831 Bytes

854c114

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 02 - Statistical Foundations\n",
                "\n",
                "Before diving into Machine Learning, it's essential to understand the data through **Statistics**. This module covers the foundational concepts you'll need for data analysis.\n",
                "\n",
                "### Resources:\n",
                "Refer to the **[Complete Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/)** on your hub for interactive demos on Population vs. Sample, Central Tendency, and Dispersion.\n",
                "\n",
                "### Objectives:\n",
                "1. **Central Tendency**: Mean, Median, and Mode.\n",
                "2. **Dispersion**: Standard Deviation, Variance, and IQR.\n",
                "3. **Probability Distributions**: Normal Distribution and Z-Scores.\n",
                "4. **Hypothesis Testing**: Understanding p-values.\n",
                "5. **Correlation**: Relationship between variables.\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Setup"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "from scipy import stats\n",
                "\n",
                "np.random.seed(42)\n",
                "data = np.random.normal(loc=100, scale=15, size=1000)\n",
                "df = pd.DataFrame(data, columns=['Score'])\n",
                "df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Central Tendency & Dispersion\n",
                "\n",
                "### Task 1: Basic Stats\n",
                "Calculate the Mean, Median, and Standard Deviation of the `Score` column."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "print(f\"Mean: {df['Score'].mean()}\")\n",
                "print(f\"Median: {df['Score'].median()}\")\n",
                "print(f\"Std Dev: {df['Score'].std()}\")\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Z-Scores & Outliers\n",
                "\n",
                "### Task 2: Finding Outliers\n",
                "A point is often considered an outlier if its Z-score is greater than 3 or less than -3. Help identify any outliers in the dataset.\n",
                "\n",
                "*Web Reference: [Outlier Detection Demo](https://aashishgarg13.github.io/DataScience/feature-engineering/)*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "df['z_score'] = stats.zscore(df['Score'])\n",
                "outliers = df[df['z_score'].abs() > 3]\n",
                "print(f\"Number of outliers: {len(outliers)}\")\n",
                "print(outliers)\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 4. Correlation\n",
                "\n",
                "### Task 3: Correlation Matrix\n",
                "Generate a second column `StudyTime` that is correlated with `Score` and calculate the Pearson correlation coefficient."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "df['StudyTime'] = df['Score'] * 0.5 + np.random.normal(0, 5, 1000)\n",
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "correlation = df.corr()\n",
                "print(correlation)\n",
                "sns.heatmap(correlation, annot=True, cmap='coolwarm')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 5. Hypothesis Testing (p-values)\n",
                "\n",
                "### Task 4: T-Test\n",
                "Test if the mean of our `Score` is significantly different from 100 using a 1-sample T-test. What is the p-value?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "t_stat, p_val = stats.ttest_1samp(df['Score'], 100)\n",
                "print(f\"T-statistic: {t_stat}\")\n",
                "print(f\"P-value: {p_val}\")\n",
                "if p_val < 0.05:\n",
                "    print(\"Statistically significant difference!\")\n",
                "else:\n",
                "    print(\"No significant difference.\")\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### Foundational Knowledge Unlocked! \n",
                "You have now mastered the mathematical core of data analysis.\n",
                "Next: **NumPy Mastery**."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}