Spaces:

rishach
/

math-chatbot-v2

Sleeping

App Files Files Community

math-chatbot-v2 / README.md

pranshu dhiman

Deploy MathSutra Space

7fab45b 24 days ago

preview code

Raw

History Blame Contribute Delete

3.96 kB

metadata

title: Math Chatbot V2
emoji: 🌍
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

MathSutra 12

MathSutra 12 is a domain-specific RAG chatbot for Class 12 Mathematics. It uses your 13 chapter PDFs, stores embeddings in ChromaDB, runs the LLM locally with Ollama, and answers with:

a detailed explanation
the most likely chapter
the most likely topic
supporting source chunks

Why this project setup

This project is best built in VS Code, not Colab.

Ollama runs locally on your machine, which fits VS Code perfectly.
ChromaDB persists data locally, so your vector database stays available between runs.
Streamlit is easier to demo from a local machine than from Colab.

Use Colab only if you want a cloud notebook experiment. For this final project, local development in VS Code is the better choice.

Recommended models

LLM: qwen2.5:7b-instruct
Embedding model: nomic-embed-text

Why this pair:

qwen2.5:7b-instruct is strong for reasoning and generally performs well on academic question answering.
nomic-embed-text is lightweight and reliable for semantic retrieval in local RAG setups.

If your laptop is powerful, you can also try qwen2.5:14b.

Models available right now in this workspace

Your local Ollama already has:

mistral:latest
embeddinggemma:latest

The included .env is configured to use these installed models so you can run the project immediately. You can switch to the recommended pair later by updating .env after pulling those models.

Project flow

Put the PDFs in a data folder in the project root.
Install Python dependencies.
Pull the Ollama models.
Run ingestion to chunk, embed, and store data in ChromaDB.
Launch Streamlit and ask questions.

Setup

Create and activate a virtual environment, then install dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create your environment file:

cp .env.example .env

Install and start Ollama, then pull the models:

ollama pull qwen2.5:7b-instruct
ollama pull nomic-embed-text
ollama serve

Ingest the PDFs

python ingest.py --reset

What this does:

discovers all chapter PDFs
cleans extracted text
chunks the content
adds chapter and topic metadata
creates embeddings with Ollama
stores everything in ChromaDB

Run the app

streamlit run app.py

Optional CLI chat

python chat.py

File overview

app.py: Streamlit showcase app
ingest.py: data ingestion pipeline
chat.py: terminal chat interface
test.py: quick smoke test for a single question
src/edurag_math_bot/config.py: settings and paths
src/edurag_math_bot/catalog.py: clean chapter metadata
src/edurag_math_bot/pdf_processing.py: PDF extraction and chunking
src/edurag_math_bot/ollama_client.py: Ollama API wrapper
src/edurag_math_bot/vector_store.py: ChromaDB wrapper
src/edurag_math_bot/rag_chain.py: retrieval + prompt + answer generation

Suggested final project explanation

You can explain your system like this:

Data collection: 13 NCERT Class 12 Mathematics PDFs
Data cleaning: text extraction and cleanup from PDFs
Chunking: splitting long text into meaningful chunks
Embedding: converting chunks into vectors using nomic-embed-text
Vector storage: storing vectors in ChromaDB
Retrieval: finding the most relevant chunks for a user question
LLM generation: sending retrieved context to qwen2.5:7b-instruct
Output formatting: answering in detail with chapter and topic prediction

Notes

The chapter name is detected from the PDF filename.
The topic is inferred from chunk headings and nearby text.
This works best on text-based PDFs. If your PDFs are image scans, add OCR first.