File size: 11,814 Bytes
721ca73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# πŸ”­ AstroBot β€” RAG-Powered educational AI System

>AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
>It demonstrates:
>End-to-end PDF ingestion β†’ structured Parquet datasets
>Semantic indexing with FAISS
>Context-grounded LLM responses via Groq (LLaMA-3)
>Modular architecture enabling easy LLM or vector DB swapping
>Public deployment on Hugging Face Spaces (CI/CD via git push)

---

## Table of Contents

1. [Project Overview](#project-overview)
2. [Tech Stack](#tech-stack)
3. [Architecture](#architecture)
4. [File Structure](#file-structure)
5. [Module Responsibilities](#module-responsibilities)
6. [Data Pipeline](#data-pipeline)
7. [Setup & Deployment](#setup--deployment)
8. [Environment Variables](#environment-variables)
9. [How to Add New Course Materials](#how-to-add-new-course-materials)
10. [Limitations & Guardrails](#limitations--guardrails)
11. [Troubleshooting](#troubleshooting)


---

## Project Overview

AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**.  
It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.

## Tech Stack

| Layer | Technology | Why |
|---|---|---|
| LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous |
| Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space |
| Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally |
| Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well |
| Framework | **LangChain** | Chunking utilities and Document schema |
| UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly |
| Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push |

### What it does
- Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
- Grounds every answer in actual course material (no hallucination of unsupported facts).
- Clearly declines to make personal predictions or interpret individual birth charts.

### What it does NOT do
- Make predictions of any kind.
- Interpret a specific person's chart.
- Answer questions unrelated to astrology concepts.

---

## Architecture

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚                        OFFLINE (once)                           β”‚

β”‚                                                                 β”‚

β”‚  Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet)   β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                                β”‚

                                β–Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚                    HF SPACE (at startup)                        β”‚

β”‚                                                                 β”‚

β”‚  data_loader.py                                                 β”‚

β”‚    └── load_dataset() from HF Hub ──► list[Document]           β”‚

β”‚                                                                 β”‚

β”‚  vector_store.py                                                β”‚

β”‚    β”œβ”€β”€ RecursiveCharacterTextSplitter ──► Chunks                β”‚

β”‚    β”œβ”€β”€ HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors           β”‚

β”‚    └── FAISS.from_documents() ──► Index                        β”‚

β”‚                                                                 β”‚

β”‚  llm.py                                                         β”‚

β”‚    └── Groq(api_key) ──► Groq Client                           β”‚

β”‚                                                                 β”‚

β”‚  rag_pipeline.py                                                β”‚

β”‚    └── RAGPipeline(index, groq_client) ──► Ready               β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                                β”‚

                                β–Ό

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚                    HF SPACE (per query)                         β”‚

β”‚                                                                 β”‚

β”‚  Student Question                                               β”‚

β”‚       β”‚                                                         β”‚

β”‚       β–Ό                                                         β”‚

β”‚  rag_pipeline.query()                                           β”‚

β”‚       β”œβ”€β”€ vector_store.retrieve()  ──► Top-K Chunks            β”‚

β”‚       └── llm.generate_answer()   ──► Grounded Answer          β”‚

β”‚                                                                 β”‚

β”‚  app.py  ──►  Gradio UI  ──►  Student sees answer              β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

---

## File Structure

```

astrobot/

β”‚

β”œβ”€β”€ app.py              # Gradio UI β€” entry point for HF Spaces

β”œβ”€β”€ config.py           # All configuration (env vars, hyperparameters)

β”œβ”€β”€ data_loader.py      # HF dataset fetching + Document creation

β”œβ”€β”€ vector_store.py     # Chunking, embedding, FAISS index

β”œβ”€β”€ llm.py              # Groq client + prompt engineering

β”œβ”€β”€ rag_pipeline.py     # Orchestrates retrieval β†’ generation

β”‚

β”œβ”€β”€ convert_pdfs.py     # Offline helper: PDFs β†’ HF Parquet dataset

β”œβ”€β”€ requirements.txt    # Python dependencies

└── PROJECT.md          # This file

```

---

## Module Responsibilities

| Module | Single Responsibility |
|---|---|
| `config.py` | Central source of truth for all settings. Change a parameter once here. |
| `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. |
| `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. |
| `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. |
| `rag_pipeline.py` | Glue layer: validate query β†’ retrieve β†’ generate β†’ return `RAGResponse`. |
| `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. |
| `convert_pdfs.py` | One-time offline script: extract PDF pages β†’ push Parquet to HF Hub. |

This separation means:
- You can swap **FAISS β†’ Pinecone** by editing only `vector_store.py`.
- You can swap **Groq β†’ OpenAI** by editing only `llm.py`.
- You can change the **system prompt** (persona, guardrails) in only `llm.py`.
- You can replace the **UI** without touching any backend logic.


---

## Data Pipeline

### Step 1 β€” Prepare your PDFs (run locally)

Place your astrology textbook PDFs in a folder and run:

```bash

pip install pypdf datasets huggingface-hub

python convert_pdfs.py \

    --pdf_dir  ./astrology_books \

    --repo_id  YOUR_USERNAME/astrology-course-materials \

    --private          # optional

```

This will:
1. Extract text from each PDF page-by-page.
2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
3. Push it to HF Hub as a Parquet-backed dataset.

### Step 2 β€” Connect to the Space

Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).

### Step 3 β€” What happens at startup

```

load_dataset()                   # ~30s for large datasets

RecursiveCharacterTextSplitter   # chunk_size=512, overlap=64

HuggingFaceEmbeddings            # ~60s to encode all chunks

FAISS.from_documents()           # <5s

```

The index is built once per Space restart and held in memory.

---

## Setup & Deployment

### 1. Create a Hugging Face Space

- Go to [huggingface.co/new-space](https://huggingface.co/new-space)
- **SDK:** Gradio
- **Hardware:** CPU Basic (free)

### 2. Upload files

Upload these files to the Space repository:
```

app.py

config.py

data_loader.py

vector_store.py

llm.py

rag_pipeline.py

requirements.txt

```

### 3. Set secrets

Go to **Space β†’ Settings β†’ Repository secrets β†’ New secret**

| Secret Name | Value |
|---|---|
| `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) β†’ API Keys |
| `HF_DATASET` | `your-username/your-dataset-name` |
| `HF_TOKEN` | Your HF token (only needed for **private** datasets) |

### 4. Done

The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).

---

## Environment Variables

All variables are read in `config.py`. You can also set them locally for development:

```bash

export GROQ_API_KEY="gsk_..."

export HF_DATASET="yourname/astrology-course-materials"

export HF_TOKEN=""          # leave blank for public datasets



python app.py

```

---

## How to Add New Course Materials

1. Add the new PDF(s) to your `./astrology_books/` folder.
2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
3. **Restart the HF Space** β€” it will re-index on next startup.

No code changes required.

---

## Limitations & Guardrails

| Limitation | Detail |
|---|---|
| **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. |
| **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. |
| **No chart interpretation** | Questions about specific birth charts are declined. |
| **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3–5 min cold start). |
| **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. |
| **Language** | Optimised for English. Other languages may work but are untested. |

---

## Troubleshooting

### Space fails to start
- Check **Logs** tab in the Space for Python errors.
- Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).

### "GROQ_API_KEY is not set"
- Add the secret in Space β†’ Settings β†’ Repository secrets.

### "No usable text column found"
- Your Parquet dataset doesn't have a column named `text`, `content`, etc.
- Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.

### Answers seem unrelated to the question
- Increase `TOP_K` in `config.py` (try 7–10).
- Decrease `CHUNK_SIZE` (try 256) for finer granularity.
- Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.

### Groq rate limit errors
- Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.

---