DemoChatBot / PROJECT.md
OnlyTheTruth03's picture
Initial Commit
721ca73 verified
# πŸ”­ AstroBot β€” RAG-Powered educational AI System
>AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
>It demonstrates:
>End-to-end PDF ingestion β†’ structured Parquet datasets
>Semantic indexing with FAISS
>Context-grounded LLM responses via Groq (LLaMA-3)
>Modular architecture enabling easy LLM or vector DB swapping
>Public deployment on Hugging Face Spaces (CI/CD via git push)
---
## Table of Contents
1. [Project Overview](#project-overview)
2. [Tech Stack](#tech-stack)
3. [Architecture](#architecture)
4. [File Structure](#file-structure)
5. [Module Responsibilities](#module-responsibilities)
6. [Data Pipeline](#data-pipeline)
7. [Setup & Deployment](#setup--deployment)
8. [Environment Variables](#environment-variables)
9. [How to Add New Course Materials](#how-to-add-new-course-materials)
10. [Limitations & Guardrails](#limitations--guardrails)
11. [Troubleshooting](#troubleshooting)
---
## Project Overview
AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**.
It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.
## Tech Stack
| Layer | Technology | Why |
|---|---|---|
| LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous |
| Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space |
| Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally |
| Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well |
| Framework | **LangChain** | Chunking utilities and Document schema |
| UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly |
| Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push |
### What it does
- Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
- Grounds every answer in actual course material (no hallucination of unsupported facts).
- Clearly declines to make personal predictions or interpret individual birth charts.
### What it does NOT do
- Make predictions of any kind.
- Interpret a specific person's chart.
- Answer questions unrelated to astrology concepts.
---
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OFFLINE (once) β”‚
β”‚ β”‚
β”‚ Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HF SPACE (at startup) β”‚
β”‚ β”‚
β”‚ data_loader.py β”‚
β”‚ └── load_dataset() from HF Hub ──► list[Document] β”‚
β”‚ β”‚
β”‚ vector_store.py β”‚
β”‚ β”œβ”€β”€ RecursiveCharacterTextSplitter ──► Chunks β”‚
β”‚ β”œβ”€β”€ HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors β”‚
β”‚ └── FAISS.from_documents() ──► Index β”‚
β”‚ β”‚
β”‚ llm.py β”‚
β”‚ └── Groq(api_key) ──► Groq Client β”‚
β”‚ β”‚
β”‚ rag_pipeline.py β”‚
β”‚ └── RAGPipeline(index, groq_client) ──► Ready β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HF SPACE (per query) β”‚
β”‚ β”‚
β”‚ Student Question β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ rag_pipeline.query() β”‚
β”‚ β”œβ”€β”€ vector_store.retrieve() ──► Top-K Chunks β”‚
β”‚ └── llm.generate_answer() ──► Grounded Answer β”‚
β”‚ β”‚
β”‚ app.py ──► Gradio UI ──► Student sees answer β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## File Structure
```
astrobot/
β”‚
β”œβ”€β”€ app.py # Gradio UI β€” entry point for HF Spaces
β”œβ”€β”€ config.py # All configuration (env vars, hyperparameters)
β”œβ”€β”€ data_loader.py # HF dataset fetching + Document creation
β”œβ”€β”€ vector_store.py # Chunking, embedding, FAISS index
β”œβ”€β”€ llm.py # Groq client + prompt engineering
β”œβ”€β”€ rag_pipeline.py # Orchestrates retrieval β†’ generation
β”‚
β”œβ”€β”€ convert_pdfs.py # Offline helper: PDFs β†’ HF Parquet dataset
β”œβ”€β”€ requirements.txt # Python dependencies
└── PROJECT.md # This file
```
---
## Module Responsibilities
| Module | Single Responsibility |
|---|---|
| `config.py` | Central source of truth for all settings. Change a parameter once here. |
| `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. |
| `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. |
| `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. |
| `rag_pipeline.py` | Glue layer: validate query β†’ retrieve β†’ generate β†’ return `RAGResponse`. |
| `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. |
| `convert_pdfs.py` | One-time offline script: extract PDF pages β†’ push Parquet to HF Hub. |
This separation means:
- You can swap **FAISS β†’ Pinecone** by editing only `vector_store.py`.
- You can swap **Groq β†’ OpenAI** by editing only `llm.py`.
- You can change the **system prompt** (persona, guardrails) in only `llm.py`.
- You can replace the **UI** without touching any backend logic.
---
## Data Pipeline
### Step 1 β€” Prepare your PDFs (run locally)
Place your astrology textbook PDFs in a folder and run:
```bash
pip install pypdf datasets huggingface-hub
python convert_pdfs.py \
--pdf_dir ./astrology_books \
--repo_id YOUR_USERNAME/astrology-course-materials \
--private # optional
```
This will:
1. Extract text from each PDF page-by-page.
2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
3. Push it to HF Hub as a Parquet-backed dataset.
### Step 2 β€” Connect to the Space
Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).
### Step 3 β€” What happens at startup
```
load_dataset() # ~30s for large datasets
RecursiveCharacterTextSplitter # chunk_size=512, overlap=64
HuggingFaceEmbeddings # ~60s to encode all chunks
FAISS.from_documents() # <5s
```
The index is built once per Space restart and held in memory.
---
## Setup & Deployment
### 1. Create a Hugging Face Space
- Go to [huggingface.co/new-space](https://huggingface.co/new-space)
- **SDK:** Gradio
- **Hardware:** CPU Basic (free)
### 2. Upload files
Upload these files to the Space repository:
```
app.py
config.py
data_loader.py
vector_store.py
llm.py
rag_pipeline.py
requirements.txt
```
### 3. Set secrets
Go to **Space β†’ Settings β†’ Repository secrets β†’ New secret**
| Secret Name | Value |
|---|---|
| `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) β†’ API Keys |
| `HF_DATASET` | `your-username/your-dataset-name` |
| `HF_TOKEN` | Your HF token (only needed for **private** datasets) |
### 4. Done
The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).
---
## Environment Variables
All variables are read in `config.py`. You can also set them locally for development:
```bash
export GROQ_API_KEY="gsk_..."
export HF_DATASET="yourname/astrology-course-materials"
export HF_TOKEN="" # leave blank for public datasets
python app.py
```
---
## How to Add New Course Materials
1. Add the new PDF(s) to your `./astrology_books/` folder.
2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
3. **Restart the HF Space** β€” it will re-index on next startup.
No code changes required.
---
## Limitations & Guardrails
| Limitation | Detail |
|---|---|
| **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. |
| **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. |
| **No chart interpretation** | Questions about specific birth charts are declined. |
| **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3–5 min cold start). |
| **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. |
| **Language** | Optimised for English. Other languages may work but are untested. |
---
## Troubleshooting
### Space fails to start
- Check **Logs** tab in the Space for Python errors.
- Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).
### "GROQ_API_KEY is not set"
- Add the secret in Space β†’ Settings β†’ Repository secrets.
### "No usable text column found"
- Your Parquet dataset doesn't have a column named `text`, `content`, etc.
- Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.
### Answers seem unrelated to the question
- Increase `TOP_K` in `config.py` (try 7–10).
- Decrease `CHUNK_SIZE` (try 256) for finer granularity.
- Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.
### Groq rate limit errors
- Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.
---