{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 02 - Statistical Foundations\n",
"\n",
"Before diving into Machine Learning, it's essential to understand the data through **Statistics**. This module covers the foundational concepts you'll need for data analysis.\n",
"\n",
"### Resources:\n",
"Refer to the **[Complete Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/)** on your hub for interactive demos on Population vs. Sample, Central Tendency, and Dispersion.\n",
"\n",
"### Objectives:\n",
"1. **Central Tendency**: Mean, Median, and Mode.\n",
"2. **Dispersion**: Standard Deviation, Variance, and IQR.\n",
"3. **Probability Distributions**: Normal Distribution and Z-Scores.\n",
"4. **Hypothesis Testing**: Understanding p-values.\n",
"5. **Correlation**: Relationship between variables.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from scipy import stats\n",
"\n",
"np.random.seed(42)\n",
"data = np.random.normal(loc=100, scale=15, size=1000)\n",
"df = pd.DataFrame(data, columns=['Score'])\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Central Tendency & Dispersion\n",
"\n",
"### Task 1: Basic Stats\n",
"Calculate the Mean, Median, and Standard Deviation of the `Score` column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"print(f\"Mean: {df['Score'].mean()}\")\n",
"print(f\"Median: {df['Score'].median()}\")\n",
"print(f\"Std Dev: {df['Score'].std()}\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Z-Scores & Outliers\n",
"\n",
"### Task 2: Finding Outliers\n",
"A point is often considered an outlier if its Z-score is greater than 3 or less than -3. Help identify any outliers in the dataset.\n",
"\n",
"*Web Reference: [Outlier Detection Demo](https://aashishgarg13.github.io/DataScience/feature-engineering/)*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"df['z_score'] = stats.zscore(df['Score'])\n",
"outliers = df[df['z_score'].abs() > 3]\n",
"print(f\"Number of outliers: {len(outliers)}\")\n",
"print(outliers)\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Correlation\n",
"\n",
"### Task 3: Correlation Matrix\n",
"Generate a second column `StudyTime` that is correlated with `Score` and calculate the Pearson correlation coefficient."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['StudyTime'] = df['Score'] * 0.5 + np.random.normal(0, 5, 1000)\n",
"# YOUR CODE HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"correlation = df.corr()\n",
"print(correlation)\n",
"sns.heatmap(correlation, annot=True, cmap='coolwarm')\n",
"plt.show()\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Hypothesis Testing (p-values)\n",
"\n",
"### Task 4: T-Test\n",
"Test if the mean of our `Score` is significantly different from 100 using a 1-sample T-test. What is the p-value?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"t_stat, p_val = stats.ttest_1samp(df['Score'], 100)\n",
"print(f\"T-statistic: {t_stat}\")\n",
"print(f\"P-value: {p_val}\")\n",
"if p_val < 0.05:\n",
" print(\"Statistically significant difference!\")\n",
"else:\n",
" print(\"No significant difference.\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### Foundational Knowledge Unlocked! \n",
"You have now mastered the mathematical core of data analysis.\n",
"Next: **NumPy Mastery**."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}