# 🔭 AstroBot — RAG-Powered educational AI System

>AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
>It demonstrates:
>End-to-end PDF ingestion → structured Parquet datasets
>Semantic indexing with FAISS
>Context-grounded LLM responses via Groq (LLaMA-3)
>Modular architecture enabling easy LLM or vector DB swapping
>Public deployment on Hugging Face Spaces (CI/CD via git push)

---

## Table of Contents

1. [Project Overview](#project-overview)
2. [Tech Stack](#tech-stack)
3. [Architecture](#architecture)
4. [File Structure](#file-structure)
5. [Module Responsibilities](#module-responsibilities)
6. [Data Pipeline](#data-pipeline)
7. [Setup & Deployment](#setup--deployment)
8. [Environment Variables](#environment-variables)
9. [How to Add New Course Materials](#how-to-add-new-course-materials)
10. [Limitations & Guardrails](#limitations--guardrails)
11. [Troubleshooting](#troubleshooting)


---

## Project Overview

AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**.  
It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.

## Tech Stack

| Layer | Technology | Why |
|---|---|---|
| LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous |
| Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space |
| Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally |
| Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well |
| Framework | **LangChain** | Chunking utilities and Document schema |
| UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly |
| Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push |

### What it does
- Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
- Grounds every answer in actual course material (no hallucination of unsupported facts).
- Clearly declines to make personal predictions or interpret individual birth charts.

### What it does NOT do
- Make predictions of any kind.
- Interpret a specific person's chart.
- Answer questions unrelated to astrology concepts.

---

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        OFFLINE (once)                           │
│                                                                 │
│  Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet)   │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    HF SPACE (at startup)                        │
│                                                                 │
│  data_loader.py                                                 │
│    └── load_dataset() from HF Hub ──► list[Document]           │
│                                                                 │
│  vector_store.py                                                │
│    ├── RecursiveCharacterTextSplitter ──► Chunks                │
│    ├── HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors           │
│    └── FAISS.from_documents() ──► Index                        │
│                                                                 │
│  llm.py                                                         │
│    └── Groq(api_key) ──► Groq Client                           │
│                                                                 │
│  rag_pipeline.py                                                │
│    └── RAGPipeline(index, groq_client) ──► Ready               │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    HF SPACE (per query)                         │
│                                                                 │
│  Student Question                                               │
│       │                                                         │
│       ▼                                                         │
│  rag_pipeline.query()                                           │
│       ├── vector_store.retrieve()  ──► Top-K Chunks            │
│       └── llm.generate_answer()   ──► Grounded Answer          │
│                                                                 │
│  app.py  ──►  Gradio UI  ──►  Student sees answer              │
└─────────────────────────────────────────────────────────────────┘
```

---

## File Structure

```
astrobot/
│
├── app.py              # Gradio UI — entry point for HF Spaces
├── config.py           # All configuration (env vars, hyperparameters)
├── data_loader.py      # HF dataset fetching + Document creation
├── vector_store.py     # Chunking, embedding, FAISS index
├── llm.py              # Groq client + prompt engineering
├── rag_pipeline.py     # Orchestrates retrieval → generation
│
├── convert_pdfs.py     # Offline helper: PDFs → HF Parquet dataset
├── requirements.txt    # Python dependencies
└── PROJECT.md          # This file
```

---

## Module Responsibilities

| Module | Single Responsibility |
|---|---|
| `config.py` | Central source of truth for all settings. Change a parameter once here. |
| `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. |
| `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. |
| `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. |
| `rag_pipeline.py` | Glue layer: validate query → retrieve → generate → return `RAGResponse`. |
| `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. |
| `convert_pdfs.py` | One-time offline script: extract PDF pages → push Parquet to HF Hub. |

This separation means:
- You can swap **FAISS → Pinecone** by editing only `vector_store.py`.
- You can swap **Groq → OpenAI** by editing only `llm.py`.
- You can change the **system prompt** (persona, guardrails) in only `llm.py`.
- You can replace the **UI** without touching any backend logic.


---

## Data Pipeline

### Step 1 — Prepare your PDFs (run locally)

Place your astrology textbook PDFs in a folder and run:

```bash
pip install pypdf datasets huggingface-hub
python convert_pdfs.py \
    --pdf_dir  ./astrology_books \
    --repo_id  YOUR_USERNAME/astrology-course-materials \
    --private          # optional
```

This will:
1. Extract text from each PDF page-by-page.
2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
3. Push it to HF Hub as a Parquet-backed dataset.

### Step 2 — Connect to the Space

Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).

### Step 3 — What happens at startup

```
load_dataset()                   # ~30s for large datasets
RecursiveCharacterTextSplitter   # chunk_size=512, overlap=64
HuggingFaceEmbeddings            # ~60s to encode all chunks
FAISS.from_documents()           # <5s
```

The index is built once per Space restart and held in memory.

---

## Setup & Deployment

### 1. Create a Hugging Face Space

- Go to [huggingface.co/new-space](https://huggingface.co/new-space)
- **SDK:** Gradio
- **Hardware:** CPU Basic (free)

### 2. Upload files

Upload these files to the Space repository:
```
app.py
config.py
data_loader.py
vector_store.py
llm.py
rag_pipeline.py
requirements.txt
```

### 3. Set secrets

Go to **Space → Settings → Repository secrets → New secret**

| Secret Name | Value |
|---|---|
| `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) → API Keys |
| `HF_DATASET` | `your-username/your-dataset-name` |
| `HF_TOKEN` | Your HF token (only needed for **private** datasets) |

### 4. Done

The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).

---

## Environment Variables

All variables are read in `config.py`. You can also set them locally for development:

```bash
export GROQ_API_KEY="gsk_..."
export HF_DATASET="yourname/astrology-course-materials"
export HF_TOKEN=""          # leave blank for public datasets

python app.py
```

---

## How to Add New Course Materials

1. Add the new PDF(s) to your `./astrology_books/` folder.
2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
3. **Restart the HF Space** — it will re-index on next startup.

No code changes required.

---

## Limitations & Guardrails

| Limitation | Detail |
|---|---|
| **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. |
| **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. |
| **No chart interpretation** | Questions about specific birth charts are declined. |
| **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3–5 min cold start). |
| **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. |
| **Language** | Optimised for English. Other languages may work but are untested. |

---

## Troubleshooting

### Space fails to start
- Check **Logs** tab in the Space for Python errors.
- Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).

### "GROQ_API_KEY is not set"
- Add the secret in Space → Settings → Repository secrets.

### "No usable text column found"
- Your Parquet dataset doesn't have a column named `text`, `content`, etc.
- Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.

### Answers seem unrelated to the question
- Increase `TOP_K` in `config.py` (try 7–10).
- Decrease `CHUNK_SIZE` (try 256) for finer granularity.
- Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.

### Groq rate limit errors
- Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.

---