Spaces:

OnlyTheTruth03
/

DemoChatBot

Sleeping

App Files Files Community

DemoChatBot / PROJECT.md

OnlyTheTruth03

Initial Commit

721ca73 verified about 1 month ago

preview code

raw

history blame contribute delete

11.8 kB

	# 🔭 AstroBot — RAG-Powered educational AI System

	>AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
	>It demonstrates:
	>End-to-end PDF ingestion → structured Parquet datasets
	>Semantic indexing with FAISS
	>Context-grounded LLM responses via Groq (LLaMA-3)
	>Modular architecture enabling easy LLM or vector DB swapping
	>Public deployment on Hugging Face Spaces (CI/CD via git push)

	---

	## Table of Contents

	1. [Project Overview](#project-overview)
	2. [Tech Stack](#tech-stack)
	3. [Architecture](#architecture)
	4. [File Structure](#file-structure)
	5. [Module Responsibilities](#module-responsibilities)
	6. [Data Pipeline](#data-pipeline)
	7. [Setup & Deployment](#setup--deployment)
	8. [Environment Variables](#environment-variables)
	9. [How to Add New Course Materials](#how-to-add-new-course-materials)
	10. [Limitations & Guardrails](#limitations--guardrails)
	11. [Troubleshooting](#troubleshooting)


	---

	## Project Overview

	AstroBot is a Retrieval-Augmented Generation (RAG) chatbot deployed on Hugging Face Spaces.
	It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.

	## Tech Stack

	\| Layer \| Technology \| Why \|
	\|---\|---\|---\|
	\| LLM \| Groq + LLaMA-3.1-8b-instant \| Fastest open-model inference; free tier generous \|
	\| Vector DB \| FAISS (CPU) \| No external service needed; runs inside the Space \|
	\| Embeddings \| sentence-transformers/all-MiniLM-L6-v2 \| Lightweight, accurate, runs locally \|
	\| Dataset \| HF Datasets (Parquet) \| Native HF Hub format; handles large PDFs well \|
	\| Framework \| LangChain \| Chunking utilities and Document schema \|
	\| UI \| Gradio 4 \| Native to HF Spaces; quick to build, mobile-friendly \|
	\| Hosting \| Hugging Face Spaces \| Free GPU/CPU hosting; CI/CD via git push \|

	### What it does
	- Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
	- Grounds every answer in actual course material (no hallucination of unsupported facts).
	- Clearly declines to make personal predictions or interpret individual birth charts.

	### What it does NOT do
	- Make predictions of any kind.
	- Interpret a specific person's chart.
	- Answer questions unrelated to astrology concepts.

	---

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ OFFLINE (once) │
	│ │
	│ Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet) │
	└─────────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────┐
	│ HF SPACE (at startup) │
	│ │
	│ data_loader.py │
	│ └── load_dataset() from HF Hub ──► list[Document] │
	│ │
	│ vector_store.py │
	│ ├── RecursiveCharacterTextSplitter ──► Chunks │
	│ ├── HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors │
	│ └── FAISS.from_documents() ──► Index │
	│ │
	│ llm.py │
	│ └── Groq(api_key) ──► Groq Client │
	│ │
	│ rag_pipeline.py │
	│ └── RAGPipeline(index, groq_client) ──► Ready │
	└─────────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────┐
	│ HF SPACE (per query) │
	│ │
	│ Student Question │
	│ │ │
	│ ▼ │
	│ rag_pipeline.query() │
	│ ├── vector_store.retrieve() ──► Top-K Chunks │
	│ └── llm.generate_answer() ──► Grounded Answer │
	│ │
	│ app.py ──► Gradio UI ──► Student sees answer │
	└─────────────────────────────────────────────────────────────────┘
	```

	---

	## File Structure

	```
	astrobot/
	│
	├── app.py # Gradio UI — entry point for HF Spaces
	├── config.py # All configuration (env vars, hyperparameters)
	├── data_loader.py # HF dataset fetching + Document creation
	├── vector_store.py # Chunking, embedding, FAISS index
	├── llm.py # Groq client + prompt engineering
	├── rag_pipeline.py # Orchestrates retrieval → generation
	│
	├── convert_pdfs.py # Offline helper: PDFs → HF Parquet dataset
	├── requirements.txt # Python dependencies
	└── PROJECT.md # This file
	```

	---

	## Module Responsibilities

	\| Module \| Single Responsibility \|
	\|---\|---\|
	\| `config.py` \| Central source of truth for all settings. Change a parameter once here. \|
	\| `data_loader.py` \| Fetch data from HF Hub; detect text column; return `list[Document]`. \|
	\| `vector_store.py` \| Chunk text; embed with sentence-transformers; build & query FAISS index. \|
	\| `llm.py` \| Validate Groq key; build system prompt; call Groq API; return answer string. \|
	\| `rag_pipeline.py` \| Glue layer: validate query → retrieve → generate → return `RAGResponse`. \|
	\| `app.py` \| UI only: Gradio layout, event wiring, error display. No business logic. \|
	\| `convert_pdfs.py` \| One-time offline script: extract PDF pages → push Parquet to HF Hub. \|

	This separation means:
	- You can swap FAISS → Pinecone by editing only `vector_store.py`.
	- You can swap Groq → OpenAI by editing only `llm.py`.
	- You can change the system prompt (persona, guardrails) in only `llm.py`.
	- You can replace the UI without touching any backend logic.


	---

	## Data Pipeline

	### Step 1 — Prepare your PDFs (run locally)

	Place your astrology textbook PDFs in a folder and run:

	```bash
	pip install pypdf datasets huggingface-hub
	python convert_pdfs.py \
	--pdf_dir ./astrology_books \
	--repo_id YOUR_USERNAME/astrology-course-materials \
	--private # optional
	```

	This will:
	1. Extract text from each PDF page-by-page.
	2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
	3. Push it to HF Hub as a Parquet-backed dataset.

	### Step 2 — Connect to the Space

	Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).

	### Step 3 — What happens at startup

	```
	load_dataset() # ~30s for large datasets
	RecursiveCharacterTextSplitter # chunk_size=512, overlap=64
	HuggingFaceEmbeddings # ~60s to encode all chunks
	FAISS.from_documents() # <5s
	```

	The index is built once per Space restart and held in memory.

	---

	## Setup & Deployment

	### 1. Create a Hugging Face Space

	- Go to [huggingface.co/new-space](https://huggingface.co/new-space)
	- SDK: Gradio
	- Hardware: CPU Basic (free)

	### 2. Upload files

	Upload these files to the Space repository:
	```
	app.py
	config.py
	data_loader.py
	vector_store.py
	llm.py
	rag_pipeline.py
	requirements.txt
	```

	### 3. Set secrets

	Go to Space → Settings → Repository secrets → New secret

	\| Secret Name \| Value \|
	\|---\|---\|
	\| `GROQ_API_KEY` \| From [console.groq.com](https://console.groq.com) → API Keys \|
	\| `HF_DATASET` \| `your-username/your-dataset-name` \|
	\| `HF_TOKEN` \| Your HF token (only needed for private datasets) \|

	### 4. Done

	The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).

	---

	## Environment Variables

	All variables are read in `config.py`. You can also set them locally for development:

	```bash
	export GROQ_API_KEY="gsk_..."
	export HF_DATASET="yourname/astrology-course-materials"
	export HF_TOKEN="" # leave blank for public datasets

	python app.py
	```

	---

	## How to Add New Course Materials

	1. Add the new PDF(s) to your `./astrology_books/` folder.
	2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
	3. Restart the HF Space — it will re-index on next startup.

	No code changes required.

	---

	## Limitations & Guardrails

	\| Limitation \| Detail \|
	\|---\|---\|
	\| No predictions \| The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. \|
	\| Grounded answers only \| If the answer isn't in the course materials, AstroBot says so rather than hallucinating. \|
	\| No chart interpretation \| Questions about specific birth charts are declined. \|
	\| Index is in-memory \| The FAISS index is rebuilt on every Space restart (~3–5 min cold start). \|
	\| Context window \| Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. \|
	\| Language \| Optimised for English. Other languages may work but are untested. \|

	---

	## Troubleshooting

	### Space fails to start
	- Check Logs tab in the Space for Python errors.
	- Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).

	### "GROQ_API_KEY is not set"
	- Add the secret in Space → Settings → Repository secrets.

	### "No usable text column found"
	- Your Parquet dataset doesn't have a column named `text`, `content`, etc.
	- Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.

	### Answers seem unrelated to the question
	- Increase `TOP_K` in `config.py` (try 7–10).
	- Decrease `CHUNK_SIZE` (try 256) for finer granularity.
	- Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.

	### Groq rate limit errors
	- Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.

	---