Spaces:

rishach
/

math-chatbot-v2

Sleeping

App Files Files Community

math-chatbot-v2 / README.md

pranshu dhiman

Deploy MathSutra Space

7fab45b 25 days ago

preview code

Raw

History Blame Contribute Delete

3.96 kB

	---
	title: Math Chatbot V2
	emoji: 🌍
	colorFrom: red
	colorTo: yellow
	sdk: docker
	pinned: false
	license: mit
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# MathSutra 12

	MathSutra 12 is a domain-specific RAG chatbot for Class 12 Mathematics. It uses your 13 chapter PDFs, stores embeddings in ChromaDB, runs the LLM locally with Ollama, and answers with:

	- a detailed explanation
	- the most likely chapter
	- the most likely topic
	- supporting source chunks

	## Why this project setup

	This project is best built in `VS Code`, not Colab.

	- `Ollama` runs locally on your machine, which fits VS Code perfectly.
	- `ChromaDB` persists data locally, so your vector database stays available between runs.
	- `Streamlit` is easier to demo from a local machine than from Colab.

	Use Colab only if you want a cloud notebook experiment. For this final project, local development in VS Code is the better choice.

	## Recommended models

	- LLM: `qwen2.5:7b-instruct`
	- Embedding model: `nomic-embed-text`

	Why this pair:

	- `qwen2.5:7b-instruct` is strong for reasoning and generally performs well on academic question answering.
	- `nomic-embed-text` is lightweight and reliable for semantic retrieval in local RAG setups.

	If your laptop is powerful, you can also try `qwen2.5:14b`.

	## Models available right now in this workspace

	Your local Ollama already has:

	- `mistral:latest`
	- `embeddinggemma:latest`

	The included `.env` is configured to use these installed models so you can run the project immediately. You can switch to the recommended pair later by updating `.env` after pulling those models.

	## Project flow

	1. Put the PDFs in a `data` folder in the project root.
	2. Install Python dependencies.
	3. Pull the Ollama models.
	4. Run ingestion to chunk, embed, and store data in ChromaDB.
	5. Launch Streamlit and ask questions.

	## Setup

	Create and activate a virtual environment, then install dependencies:

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	```

	Create your environment file:

	```bash
	cp .env.example .env
	```

	Install and start Ollama, then pull the models:

	```bash
	ollama pull qwen2.5:7b-instruct
	ollama pull nomic-embed-text
	ollama serve
	```

	## Ingest the PDFs

	```bash
	python ingest.py --reset
	```

	What this does:

	- discovers all chapter PDFs
	- cleans extracted text
	- chunks the content
	- adds chapter and topic metadata
	- creates embeddings with Ollama
	- stores everything in ChromaDB

	## Run the app

	```bash
	streamlit run app.py
	```

	## Optional CLI chat

	```bash
	python chat.py
	```

	## File overview

	- `app.py`: Streamlit showcase app
	- `ingest.py`: data ingestion pipeline
	- `chat.py`: terminal chat interface
	- `test.py`: quick smoke test for a single question
	- `src/edurag_math_bot/config.py`: settings and paths
	- `src/edurag_math_bot/catalog.py`: clean chapter metadata
	- `src/edurag_math_bot/pdf_processing.py`: PDF extraction and chunking
	- `src/edurag_math_bot/ollama_client.py`: Ollama API wrapper
	- `src/edurag_math_bot/vector_store.py`: ChromaDB wrapper
	- `src/edurag_math_bot/rag_chain.py`: retrieval + prompt + answer generation

	## Suggested final project explanation

	You can explain your system like this:

	1. Data collection: 13 NCERT Class 12 Mathematics PDFs
	2. Data cleaning: text extraction and cleanup from PDFs
	3. Chunking: splitting long text into meaningful chunks
	4. Embedding: converting chunks into vectors using `nomic-embed-text`
	5. Vector storage: storing vectors in `ChromaDB`
	6. Retrieval: finding the most relevant chunks for a user question
	7. LLM generation: sending retrieved context to `qwen2.5:7b-instruct`
	8. Output formatting: answering in detail with chapter and topic prediction

	## Notes

	- The chapter name is detected from the PDF filename.
	- The topic is inferred from chunk headings and nearby text.
	- This works best on text-based PDFs. If your PDFs are image scans, add OCR first.