{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 17 - Reinforcement Learning (Q-Learning)\n",
                "\n",
                "Welcome to Module 17! We are exploring **Reinforcement Learning** (RL). Unlike supervised learning, RL agents learn by interacting with an environment and receiving rewards or penalties.\n",
                "\n",
                "### Resources:\n",
                "Check out the **[Q-Learning Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for a breakdown of the Bellman Equation ($Q(s,a)$) and how the Agent-Environment loop works.\n",
                "\n",
                "### Objectives:\n",
                "1. **Agent-Environment Loop**: States, Actions, and Rewards.\n",
                "2. **Exploration vs. Exploitation**: The Epsilon-Greedy strategy.\n",
                "3. **Q-Table**: Learning the quality of actions.\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Environment Simulation\n",
                "We will implement a simple \"Grid World\" where an agent has to find a treasure while avoiding traps."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "\n",
                "class SimpleGridWorld:\n",
                "    def __init__(self, size=5):\n",
                "        self.size = size\n",
                "        self.state = (0, 0)\n",
                "        self.goal = (size-1, size-1)\n",
                "        self.trap = (size//2, size//2)\n",
                "        \n",
                "    def step(self, action):\n",
                "        # 0=Up, 1=Down, 2=Left, 3=Right\n",
                "        r, c = self.state\n",
                "        if action == 0: r = max(0, r-1)\n",
                "        elif action == 1: r = min(self.size-1, r+1)\n",
                "        elif action == 2: c = max(0, c-1)\n",
                "        elif action == 3: c = min(self.size-1, c+1)\n",
                "        \n",
                "        self.state = (r, c)\n",
                "        \n",
                "        if self.state == self.goal:\n",
                "            return self.state, 10, True\n",
                "        elif self.state == self.trap:\n",
                "            return self.state, -5, True\n",
                "        return self.state, -1, False\n",
                "\n",
                "    def reset(self):\n",
                "        self.state = (0, 0)\n",
                "        return self.state\n",
                "\n",
                "env = SimpleGridWorld()\n",
                "print(\"Environment initialized!\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Q-Learning Algorithm\n",
                "\n",
                "### Task 1: Training the Agent\n",
                "Initialize a Q-Table (5x5x4) with zeros and train the agent for 1000 episodes using the update rule:\n",
                "$Q(s, a) = Q(s, a) + \\alpha [R + \\gamma \\max Q(s', a') - Q(s, a)]$"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "alpha = 0.1    # Learning rate\n",
                "gamma = 0.9    # Discount factor\n",
                "epsilon = 0.2  # Exploration rate\n",
                "q_table = np.zeros((5, 5, 4))\n",
                "\n",
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "for episode in range(1000):\n",
                "    state = env.reset()\n",
                "    done = False\n",
                "    \n",
                "    while not done:\n",
                "        # Choose action\n",
                "        if np.random.uniform(0, 1) < epsilon:\n",
                "            action = np.random.choice(4) # Explore\n",
                "        else:\n",
                "            action = np.argmax(q_table[state[0], state[1]]) # Exploit\n",
                "            \n",
                "        next_state, reward, done = env.step(action)\n",
                "        \n",
                "        # Update Q-table\n",
                "        old_value = q_table[state[0], state[1], action]\n",
                "        next_max = np.max(q_table[next_state[0], next_state[1]])\n",
                "        \n",
                "        new_value = old_value + alpha * (reward + gamma * next_max - old_value)\n",
                "        q_table[state[0], state[1], action] = new_value\n",
                "        \n",
                "        state = next_state\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Policy Visualization\n",
                "\n",
                "### Task 2: What did it learn?\n",
                "Display the learned policy by showing the best action for each cell in the grid."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "policy = np.argmax(q_table, axis=2)\n",
                "print(\"Learned Policy (0=Up, 1=Down, 2=Left, 3=Right):\")\n",
                "print(policy)\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### Awesome Work! \n",
                "You've implemented a classic RL agent from scratch. This is how robots and game AI learn!\n",
                "You have now completed the entire practice series!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}