Spaces:
Sleeping
Sleeping
| title: Math Chatbot V2 | |
| emoji: 🌍 | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # MathSutra 12 | |
| MathSutra 12 is a domain-specific RAG chatbot for Class 12 Mathematics. It uses your 13 chapter PDFs, stores embeddings in ChromaDB, runs the LLM locally with Ollama, and answers with: | |
| - a detailed explanation | |
| - the most likely chapter | |
| - the most likely topic | |
| - supporting source chunks | |
| ## Why this project setup | |
| This project is best built in `VS Code`, not Colab. | |
| - `Ollama` runs locally on your machine, which fits VS Code perfectly. | |
| - `ChromaDB` persists data locally, so your vector database stays available between runs. | |
| - `Streamlit` is easier to demo from a local machine than from Colab. | |
| Use Colab only if you want a cloud notebook experiment. For this final project, local development in VS Code is the better choice. | |
| ## Recommended models | |
| - LLM: `qwen2.5:7b-instruct` | |
| - Embedding model: `nomic-embed-text` | |
| Why this pair: | |
| - `qwen2.5:7b-instruct` is strong for reasoning and generally performs well on academic question answering. | |
| - `nomic-embed-text` is lightweight and reliable for semantic retrieval in local RAG setups. | |
| If your laptop is powerful, you can also try `qwen2.5:14b`. | |
| ## Models available right now in this workspace | |
| Your local Ollama already has: | |
| - `mistral:latest` | |
| - `embeddinggemma:latest` | |
| The included `.env` is configured to use these installed models so you can run the project immediately. You can switch to the recommended pair later by updating `.env` after pulling those models. | |
| ## Project flow | |
| 1. Put the PDFs in a `data` folder in the project root. | |
| 2. Install Python dependencies. | |
| 3. Pull the Ollama models. | |
| 4. Run ingestion to chunk, embed, and store data in ChromaDB. | |
| 5. Launch Streamlit and ask questions. | |
| ## Setup | |
| Create and activate a virtual environment, then install dependencies: | |
| ```bash | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| Create your environment file: | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| Install and start Ollama, then pull the models: | |
| ```bash | |
| ollama pull qwen2.5:7b-instruct | |
| ollama pull nomic-embed-text | |
| ollama serve | |
| ``` | |
| ## Ingest the PDFs | |
| ```bash | |
| python ingest.py --reset | |
| ``` | |
| What this does: | |
| - discovers all chapter PDFs | |
| - cleans extracted text | |
| - chunks the content | |
| - adds chapter and topic metadata | |
| - creates embeddings with Ollama | |
| - stores everything in ChromaDB | |
| ## Run the app | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| ## Optional CLI chat | |
| ```bash | |
| python chat.py | |
| ``` | |
| ## File overview | |
| - `app.py`: Streamlit showcase app | |
| - `ingest.py`: data ingestion pipeline | |
| - `chat.py`: terminal chat interface | |
| - `test.py`: quick smoke test for a single question | |
| - `src/edurag_math_bot/config.py`: settings and paths | |
| - `src/edurag_math_bot/catalog.py`: clean chapter metadata | |
| - `src/edurag_math_bot/pdf_processing.py`: PDF extraction and chunking | |
| - `src/edurag_math_bot/ollama_client.py`: Ollama API wrapper | |
| - `src/edurag_math_bot/vector_store.py`: ChromaDB wrapper | |
| - `src/edurag_math_bot/rag_chain.py`: retrieval + prompt + answer generation | |
| ## Suggested final project explanation | |
| You can explain your system like this: | |
| 1. Data collection: 13 NCERT Class 12 Mathematics PDFs | |
| 2. Data cleaning: text extraction and cleanup from PDFs | |
| 3. Chunking: splitting long text into meaningful chunks | |
| 4. Embedding: converting chunks into vectors using `nomic-embed-text` | |
| 5. Vector storage: storing vectors in `ChromaDB` | |
| 6. Retrieval: finding the most relevant chunks for a user question | |
| 7. LLM generation: sending retrieved context to `qwen2.5:7b-instruct` | |
| 8. Output formatting: answering in detail with chapter and topic prediction | |
| ## Notes | |
| - The chapter name is detected from the PDF filename. | |
| - The topic is inferred from chunk headings and nearby text. | |
| - This works best on text-based PDFs. If your PDFs are image scans, add OCR first. | |