{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "title-overview",
   "metadata": {},
   "source": [
    "# MathSutra 12: Full Jupyter Workflow\n",
    "\n",
    "`MathSutra 12` is a domain-specific RAG chatbot for Class 12 Mathematics.\n",
    "\n",
    "This notebook contains the complete project workflow:\n",
    "- environment setup\n",
    "- PDF discovery\n",
    "- ingestion into ChromaDB\n",
    "- retrieval and question answering with Ollama\n",
    "- result inspection for chapter and topic detection\n",
    "\n",
    "Project stack:\n",
    "- `Ollama` for LLM and embeddings\n",
    "- `ChromaDB` for vector storage\n",
    "- `PyPDF` for text extraction\n",
    "- `Streamlit` for the final showcase app"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pipeline-section",
   "metadata": {},
   "source": [
    "## Project Pipeline\n",
    "\n",
    "The end-to-end flow of this project is:\n",
    "\n",
    "`PDFs -> Text Extraction -> Cleaning -> Chunking -> Embeddings -> ChromaDB -> Retrieval -> Prompting -> Final Answer`\n",
    "\n",
    "Detailed steps:\n",
    "1. Collect 13 chapter PDFs of Class 12 Mathematics.\n",
    "2. Extract page-wise text from each PDF.\n",
    "3. Clean noisy formatting and spacing.\n",
    "4. Split text into smaller chunks.\n",
    "5. Convert chunks into embeddings.\n",
    "6. Store embeddings and metadata in `ChromaDB`.\n",
    "7. Convert the user question into an embedding.\n",
    "8. Retrieve the most relevant chunks.\n",
    "9. Build a prompt using retrieved context.\n",
    "10. Ask the Ollama LLM to generate a detailed answer.\n",
    "11. Return the answer along with predicted chapter and topic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "install-dependencies",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "prepare-environment",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import shutil\n",
    "\n",
    "root = Path.cwd()\n",
    "env_example = root / '.env.example'\n",
    "env_file = root / '.env'\n",
    "\n",
    "if env_example.exists() and not env_file.exists():\n",
    "    shutil.copy(env_example, env_file)\n",
    "    print('Created .env from .env.example')\n",
    "else:\n",
    "    print('.env already exists or .env.example is missing')\n",
    "\n",
    "print('Working directory:', root)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "check-pdfs",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "pdfs = sorted(Path.cwd().glob('*.pdf'))\n",
    "print(f'Found {len(pdfs)} PDF files')\n",
    "for pdf in pdfs:\n",
    "    print('-', pdf.name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "load-settings",
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.edurag_math_bot.config import get_settings\n",
    "\n",
    "settings = get_settings()\n",
    "settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "run-ingestion",
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.edurag_math_bot.rag_chain import run_ingestion\n",
    "\n",
    "summary = run_ingestion(settings=settings, reset=True)\n",
    "summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "read-summary",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "\n",
    "summary_path = Path('data/processed/ingestion_summary.json')\n",
    "print(summary_path.read_text())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rag-test-section",
   "metadata": {},
   "source": [
    "## Test the Chatbot\n",
    "\n",
    "The next cells initialize the RAG assistant and run a sample Class 12 Maths question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initialize-assistant",
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.edurag_math_bot.rag_chain import MathRAGAssistant\n",
    "\n",
    "assistant = MathRAGAssistant(settings=settings)\n",
    "print('Assistant initialized successfully')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "sample-question",
   "metadata": {},
   "outputs": [],
   "source": [
    "question = 'Explain continuity and differentiability and state how they are related.'\n",
    "result = assistant.answer(question)\n",
    "print(result['answer'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "sample-sources",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Sources used:')\n",
    "for source in result['sources']:\n",
    "    print(source)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "custom-question",
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_question = 'What is Bayes theorem in probability? Explain with formula.'\n",
    "custom_result = assistant.answer(custom_question)\n",
    "print(custom_result['answer'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "custom-sources",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Sources used for custom question:')\n",
    "for source in custom_result['sources']:\n",
    "    print(source)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mentor-summary",
   "metadata": {},
   "source": [
    "## Short Explanation for Mentor\n",
    "\n",
    "You can explain the project like this:\n",
    "\n",
    "> This project is a subject-specific RAG chatbot for Class 12 Mathematics. I used 13 chapter PDFs as the knowledge base. First, I extracted and cleaned text from the PDFs, then split it into chunks. These chunks were converted into embeddings and stored in ChromaDB. When a user asks a question, the system retrieves the most relevant chunks and sends them to an Ollama LLM. The model generates a detailed answer and also identifies the likely chapter and topic of the question.\n",
    "\n",
    "Current local models used in this project:\n",
    "- LLM: `mistral:latest`\n",
    "- Embedding model: `embeddinggemma:latest`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "streamlit-run",
   "metadata": {},
   "source": [
    "## Run the Streamlit App\n",
    "\n",
    "After the notebook workflow is complete, launch the final project UI from the terminal:\n",
    "\n",
    "```bash\n",
    "streamlit run app.py\n",
    "```\n",
    "\n",
    "Important project files:\n",
    "- `app.py`\n",
    "- `ingest.py`\n",
    "- `chat.py`\n",
    "- `src/edurag_math_bot/rag_chain.py`\n",
    "- `src/edurag_math_bot/pdf_processing.py`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "EduRag Math Bot (.venv)",
   "language": "python",
   "name": "edurag-math-bot"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}