{ "cells": [ { "cell_type": "markdown", "id": "title-overview", "metadata": {}, "source": [ "# MathSutra 12: Full Jupyter Workflow\n", "\n", "`MathSutra 12` is a domain-specific RAG chatbot for Class 12 Mathematics.\n", "\n", "This notebook contains the complete project workflow:\n", "- environment setup\n", "- PDF discovery\n", "- ingestion into ChromaDB\n", "- retrieval and question answering with Ollama\n", "- result inspection for chapter and topic detection\n", "\n", "Project stack:\n", "- `Ollama` for LLM and embeddings\n", "- `ChromaDB` for vector storage\n", "- `PyPDF` for text extraction\n", "- `Streamlit` for the final showcase app" ] }, { "cell_type": "markdown", "id": "pipeline-section", "metadata": {}, "source": [ "## Project Pipeline\n", "\n", "The end-to-end flow of this project is:\n", "\n", "`PDFs -> Text Extraction -> Cleaning -> Chunking -> Embeddings -> ChromaDB -> Retrieval -> Prompting -> Final Answer`\n", "\n", "Detailed steps:\n", "1. Collect 13 chapter PDFs of Class 12 Mathematics.\n", "2. Extract page-wise text from each PDF.\n", "3. Clean noisy formatting and spacing.\n", "4. Split text into smaller chunks.\n", "5. Convert chunks into embeddings.\n", "6. Store embeddings and metadata in `ChromaDB`.\n", "7. Convert the user question into an embedding.\n", "8. Retrieve the most relevant chunks.\n", "9. Build a prompt using retrieved context.\n", "10. Ask the Ollama LLM to generate a detailed answer.\n", "11. Return the answer along with predicted chapter and topic." ] }, { "cell_type": "code", "execution_count": null, "id": "install-dependencies", "metadata": {}, "outputs": [], "source": [ "%pip install -r requirements.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "prepare-environment", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import shutil\n", "\n", "root = Path.cwd()\n", "env_example = root / '.env.example'\n", "env_file = root / '.env'\n", "\n", "if env_example.exists() and not env_file.exists():\n", " shutil.copy(env_example, env_file)\n", " print('Created .env from .env.example')\n", "else:\n", " print('.env already exists or .env.example is missing')\n", "\n", "print('Working directory:', root)" ] }, { "cell_type": "code", "execution_count": null, "id": "check-pdfs", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "pdfs = sorted(Path.cwd().glob('*.pdf'))\n", "print(f'Found {len(pdfs)} PDF files')\n", "for pdf in pdfs:\n", " print('-', pdf.name)" ] }, { "cell_type": "code", "execution_count": null, "id": "load-settings", "metadata": {}, "outputs": [], "source": [ "from src.edurag_math_bot.config import get_settings\n", "\n", "settings = get_settings()\n", "settings" ] }, { "cell_type": "code", "execution_count": null, "id": "run-ingestion", "metadata": {}, "outputs": [], "source": [ "from src.edurag_math_bot.rag_chain import run_ingestion\n", "\n", "summary = run_ingestion(settings=settings, reset=True)\n", "summary" ] }, { "cell_type": "code", "execution_count": null, "id": "read-summary", "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "\n", "summary_path = Path('data/processed/ingestion_summary.json')\n", "print(summary_path.read_text())\n" ] }, { "cell_type": "markdown", "id": "rag-test-section", "metadata": {}, "source": [ "## Test the Chatbot\n", "\n", "The next cells initialize the RAG assistant and run a sample Class 12 Maths question." ] }, { "cell_type": "code", "execution_count": null, "id": "initialize-assistant", "metadata": {}, "outputs": [], "source": [ "from src.edurag_math_bot.rag_chain import MathRAGAssistant\n", "\n", "assistant = MathRAGAssistant(settings=settings)\n", "print('Assistant initialized successfully')" ] }, { "cell_type": "code", "execution_count": null, "id": "sample-question", "metadata": {}, "outputs": [], "source": [ "question = 'Explain continuity and differentiability and state how they are related.'\n", "result = assistant.answer(question)\n", "print(result['answer'])" ] }, { "cell_type": "code", "execution_count": null, "id": "sample-sources", "metadata": {}, "outputs": [], "source": [ "print('Sources used:')\n", "for source in result['sources']:\n", " print(source)" ] }, { "cell_type": "code", "execution_count": null, "id": "custom-question", "metadata": {}, "outputs": [], "source": [ "custom_question = 'What is Bayes theorem in probability? Explain with formula.'\n", "custom_result = assistant.answer(custom_question)\n", "print(custom_result['answer'])" ] }, { "cell_type": "code", "execution_count": null, "id": "custom-sources", "metadata": {}, "outputs": [], "source": [ "print('Sources used for custom question:')\n", "for source in custom_result['sources']:\n", " print(source)" ] }, { "cell_type": "markdown", "id": "mentor-summary", "metadata": {}, "source": [ "## Short Explanation for Mentor\n", "\n", "You can explain the project like this:\n", "\n", "> This project is a subject-specific RAG chatbot for Class 12 Mathematics. I used 13 chapter PDFs as the knowledge base. First, I extracted and cleaned text from the PDFs, then split it into chunks. These chunks were converted into embeddings and stored in ChromaDB. When a user asks a question, the system retrieves the most relevant chunks and sends them to an Ollama LLM. The model generates a detailed answer and also identifies the likely chapter and topic of the question.\n", "\n", "Current local models used in this project:\n", "- LLM: `mistral:latest`\n", "- Embedding model: `embeddinggemma:latest`" ] }, { "cell_type": "markdown", "id": "streamlit-run", "metadata": {}, "source": [ "## Run the Streamlit App\n", "\n", "After the notebook workflow is complete, launch the final project UI from the terminal:\n", "\n", "```bash\n", "streamlit run app.py\n", "```\n", "\n", "Important project files:\n", "- `app.py`\n", "- `ingest.py`\n", "- `chat.py`\n", "- `src/edurag_math_bot/rag_chain.py`\n", "- `src/edurag_math_bot/pdf_processing.py`" ] } ], "metadata": { "kernelspec": { "display_name": "EduRag Math Bot (.venv)", "language": "python", "name": "edurag-math-bot" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.10" } }, "nbformat": 4, "nbformat_minor": 5 }