Spaces:
Sleeping
Sleeping
Initial Commit
Browse files- PROJECT.md +277 -0
- README.md +21 -13
- app.py +221 -0
- config.py +46 -0
- convert_pdfs.py +46 -0
- data_loader.py +105 -0
- llm.py +119 -0
- rag_pipeline.py +135 -0
- requirements.txt +11 -0
- vector_store.py +103 -0
PROJECT.md
ADDED
|
@@ -0,0 +1,277 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π AstroBot β RAG-Powered educational AI System
|
| 2 |
+
|
| 3 |
+
>AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
|
| 4 |
+
>It demonstrates:
|
| 5 |
+
>End-to-end PDF ingestion β structured Parquet datasets
|
| 6 |
+
>Semantic indexing with FAISS
|
| 7 |
+
>Context-grounded LLM responses via Groq (LLaMA-3)
|
| 8 |
+
>Modular architecture enabling easy LLM or vector DB swapping
|
| 9 |
+
>Public deployment on Hugging Face Spaces (CI/CD via git push)
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Table of Contents
|
| 14 |
+
|
| 15 |
+
1. [Project Overview](#project-overview)
|
| 16 |
+
2. [Tech Stack](#tech-stack)
|
| 17 |
+
3. [Architecture](#architecture)
|
| 18 |
+
4. [File Structure](#file-structure)
|
| 19 |
+
5. [Module Responsibilities](#module-responsibilities)
|
| 20 |
+
6. [Data Pipeline](#data-pipeline)
|
| 21 |
+
7. [Setup & Deployment](#setup--deployment)
|
| 22 |
+
8. [Environment Variables](#environment-variables)
|
| 23 |
+
9. [How to Add New Course Materials](#how-to-add-new-course-materials)
|
| 24 |
+
10. [Limitations & Guardrails](#limitations--guardrails)
|
| 25 |
+
11. [Troubleshooting](#troubleshooting)
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## Project Overview
|
| 31 |
+
|
| 32 |
+
AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**.
|
| 33 |
+
It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.
|
| 34 |
+
|
| 35 |
+
## Tech Stack
|
| 36 |
+
|
| 37 |
+
| Layer | Technology | Why |
|
| 38 |
+
|---|---|---|
|
| 39 |
+
| LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous |
|
| 40 |
+
| Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space |
|
| 41 |
+
| Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally |
|
| 42 |
+
| Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well |
|
| 43 |
+
| Framework | **LangChain** | Chunking utilities and Document schema |
|
| 44 |
+
| UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly |
|
| 45 |
+
| Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push |
|
| 46 |
+
|
| 47 |
+
### What it does
|
| 48 |
+
- Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
|
| 49 |
+
- Grounds every answer in actual course material (no hallucination of unsupported facts).
|
| 50 |
+
- Clearly declines to make personal predictions or interpret individual birth charts.
|
| 51 |
+
|
| 52 |
+
### What it does NOT do
|
| 53 |
+
- Make predictions of any kind.
|
| 54 |
+
- Interpret a specific person's chart.
|
| 55 |
+
- Answer questions unrelated to astrology concepts.
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## Architecture
|
| 60 |
+
|
| 61 |
+
```
|
| 62 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 63 |
+
β OFFLINE (once) β
|
| 64 |
+
β β
|
| 65 |
+
β Astrology PDFs βββΊ convert_pdfs.py βββΊ HF Dataset (Parquet) β
|
| 66 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 67 |
+
β
|
| 68 |
+
βΌ
|
| 69 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 70 |
+
β HF SPACE (at startup) β
|
| 71 |
+
β β
|
| 72 |
+
β data_loader.py β
|
| 73 |
+
β βββ load_dataset() from HF Hub βββΊ list[Document] β
|
| 74 |
+
β β
|
| 75 |
+
β vector_store.py β
|
| 76 |
+
β βββ RecursiveCharacterTextSplitter βββΊ Chunks β
|
| 77 |
+
β βββ HuggingFaceEmbeddings (MiniLM-L6) βββΊ Vectors β
|
| 78 |
+
β βββ FAISS.from_documents() βββΊ Index β
|
| 79 |
+
β β
|
| 80 |
+
β llm.py β
|
| 81 |
+
β βββ Groq(api_key) βββΊ Groq Client β
|
| 82 |
+
β β
|
| 83 |
+
β rag_pipeline.py β
|
| 84 |
+
β βββ RAGPipeline(index, groq_client) βββΊ Ready β
|
| 85 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 86 |
+
β
|
| 87 |
+
βΌ
|
| 88 |
+
ββββββββββββββββββββββββββββββββββββββββββοΏ½οΏ½οΏ½ββββββββββββββββββββββββ
|
| 89 |
+
β HF SPACE (per query) β
|
| 90 |
+
β β
|
| 91 |
+
β Student Question β
|
| 92 |
+
β β β
|
| 93 |
+
β βΌ β
|
| 94 |
+
β rag_pipeline.query() β
|
| 95 |
+
β βββ vector_store.retrieve() βββΊ Top-K Chunks β
|
| 96 |
+
β βββ llm.generate_answer() βββΊ Grounded Answer β
|
| 97 |
+
β β
|
| 98 |
+
β app.py βββΊ Gradio UI βββΊ Student sees answer β
|
| 99 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## File Structure
|
| 105 |
+
|
| 106 |
+
```
|
| 107 |
+
astrobot/
|
| 108 |
+
β
|
| 109 |
+
βββ app.py # Gradio UI β entry point for HF Spaces
|
| 110 |
+
βββ config.py # All configuration (env vars, hyperparameters)
|
| 111 |
+
βββ data_loader.py # HF dataset fetching + Document creation
|
| 112 |
+
βββ vector_store.py # Chunking, embedding, FAISS index
|
| 113 |
+
βββ llm.py # Groq client + prompt engineering
|
| 114 |
+
βββ rag_pipeline.py # Orchestrates retrieval β generation
|
| 115 |
+
β
|
| 116 |
+
βββ convert_pdfs.py # Offline helper: PDFs β HF Parquet dataset
|
| 117 |
+
βββ requirements.txt # Python dependencies
|
| 118 |
+
βββ PROJECT.md # This file
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Module Responsibilities
|
| 124 |
+
|
| 125 |
+
| Module | Single Responsibility |
|
| 126 |
+
|---|---|
|
| 127 |
+
| `config.py` | Central source of truth for all settings. Change a parameter once here. |
|
| 128 |
+
| `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. |
|
| 129 |
+
| `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. |
|
| 130 |
+
| `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. |
|
| 131 |
+
| `rag_pipeline.py` | Glue layer: validate query β retrieve β generate β return `RAGResponse`. |
|
| 132 |
+
| `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. |
|
| 133 |
+
| `convert_pdfs.py` | One-time offline script: extract PDF pages β push Parquet to HF Hub. |
|
| 134 |
+
|
| 135 |
+
This separation means:
|
| 136 |
+
- You can swap **FAISS β Pinecone** by editing only `vector_store.py`.
|
| 137 |
+
- You can swap **Groq β OpenAI** by editing only `llm.py`.
|
| 138 |
+
- You can change the **system prompt** (persona, guardrails) in only `llm.py`.
|
| 139 |
+
- You can replace the **UI** without touching any backend logic.
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
---
|
| 143 |
+
|
| 144 |
+
## Data Pipeline
|
| 145 |
+
|
| 146 |
+
### Step 1 β Prepare your PDFs (run locally)
|
| 147 |
+
|
| 148 |
+
Place your astrology textbook PDFs in a folder and run:
|
| 149 |
+
|
| 150 |
+
```bash
|
| 151 |
+
pip install pypdf datasets huggingface-hub
|
| 152 |
+
python convert_pdfs.py \
|
| 153 |
+
--pdf_dir ./astrology_books \
|
| 154 |
+
--repo_id YOUR_USERNAME/astrology-course-materials \
|
| 155 |
+
--private # optional
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
This will:
|
| 159 |
+
1. Extract text from each PDF page-by-page.
|
| 160 |
+
2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
|
| 161 |
+
3. Push it to HF Hub as a Parquet-backed dataset.
|
| 162 |
+
|
| 163 |
+
### Step 2 β Connect to the Space
|
| 164 |
+
|
| 165 |
+
Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).
|
| 166 |
+
|
| 167 |
+
### Step 3 β What happens at startup
|
| 168 |
+
|
| 169 |
+
```
|
| 170 |
+
load_dataset() # ~30s for large datasets
|
| 171 |
+
RecursiveCharacterTextSplitter # chunk_size=512, overlap=64
|
| 172 |
+
HuggingFaceEmbeddings # ~60s to encode all chunks
|
| 173 |
+
FAISS.from_documents() # <5s
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
The index is built once per Space restart and held in memory.
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## Setup & Deployment
|
| 181 |
+
|
| 182 |
+
### 1. Create a Hugging Face Space
|
| 183 |
+
|
| 184 |
+
- Go to [huggingface.co/new-space](https://huggingface.co/new-space)
|
| 185 |
+
- **SDK:** Gradio
|
| 186 |
+
- **Hardware:** CPU Basic (free)
|
| 187 |
+
|
| 188 |
+
### 2. Upload files
|
| 189 |
+
|
| 190 |
+
Upload these files to the Space repository:
|
| 191 |
+
```
|
| 192 |
+
app.py
|
| 193 |
+
config.py
|
| 194 |
+
data_loader.py
|
| 195 |
+
vector_store.py
|
| 196 |
+
llm.py
|
| 197 |
+
rag_pipeline.py
|
| 198 |
+
requirements.txt
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
### 3. Set secrets
|
| 202 |
+
|
| 203 |
+
Go to **Space β Settings β Repository secrets β New secret**
|
| 204 |
+
|
| 205 |
+
| Secret Name | Value |
|
| 206 |
+
|---|---|
|
| 207 |
+
| `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) β API Keys |
|
| 208 |
+
| `HF_DATASET` | `your-username/your-dataset-name` |
|
| 209 |
+
| `HF_TOKEN` | Your HF token (only needed for **private** datasets) |
|
| 210 |
+
|
| 211 |
+
### 4. Done
|
| 212 |
+
|
| 213 |
+
The Space will auto-rebuild. Startup takes ~3β5 minutes (embedding model download + indexing).
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
## Environment Variables
|
| 218 |
+
|
| 219 |
+
All variables are read in `config.py`. You can also set them locally for development:
|
| 220 |
+
|
| 221 |
+
```bash
|
| 222 |
+
export GROQ_API_KEY="gsk_..."
|
| 223 |
+
export HF_DATASET="yourname/astrology-course-materials"
|
| 224 |
+
export HF_TOKEN="" # leave blank for public datasets
|
| 225 |
+
|
| 226 |
+
python app.py
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
## How to Add New Course Materials
|
| 232 |
+
|
| 233 |
+
1. Add the new PDF(s) to your `./astrology_books/` folder.
|
| 234 |
+
2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
|
| 235 |
+
3. **Restart the HF Space** β it will re-index on next startup.
|
| 236 |
+
|
| 237 |
+
No code changes required.
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
## Limitations & Guardrails
|
| 242 |
+
|
| 243 |
+
| Limitation | Detail |
|
| 244 |
+
|---|---|
|
| 245 |
+
| **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. |
|
| 246 |
+
| **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. |
|
| 247 |
+
| **No chart interpretation** | Questions about specific birth charts are declined. |
|
| 248 |
+
| **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3β5 min cold start). |
|
| 249 |
+
| **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. |
|
| 250 |
+
| **Language** | Optimised for English. Other languages may work but are untested. |
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## Troubleshooting
|
| 255 |
+
|
| 256 |
+
### Space fails to start
|
| 257 |
+
- Check **Logs** tab in the Space for Python errors.
|
| 258 |
+
- Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).
|
| 259 |
+
|
| 260 |
+
### "GROQ_API_KEY is not set"
|
| 261 |
+
- Add the secret in Space β Settings β Repository secrets.
|
| 262 |
+
|
| 263 |
+
### "No usable text column found"
|
| 264 |
+
- Your Parquet dataset doesn't have a column named `text`, `content`, etc.
|
| 265 |
+
- Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.
|
| 266 |
+
|
| 267 |
+
### Answers seem unrelated to the question
|
| 268 |
+
- Increase `TOP_K` in `config.py` (try 7β10).
|
| 269 |
+
- Decrease `CHUNK_SIZE` (try 256) for finer granularity.
|
| 270 |
+
- Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.
|
| 271 |
+
|
| 272 |
+
### Groq rate limit errors
|
| 273 |
+
- Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.
|
| 274 |
+
|
| 275 |
+
---
|
| 276 |
+
|
| 277 |
+
|
README.md
CHANGED
|
@@ -1,13 +1,21 @@
|
|
| 1 |
-
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: 6.6.0
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
license: apache-2.0
|
| 11 |
-
---
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Astrobot
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: pink
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 6.6.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
---
|
| 12 |
+
# π AstroBot
|
| 13 |
+
|
| 14 |
+
**RAG-powered astrology tutor for students.**
|
| 15 |
+
Ask about planets, houses, signs, aspects, transits β grounded in your course materials.
|
| 16 |
+
|
| 17 |
+
> π Explains concepts only Β· No personal predictions Β· No chart readings
|
| 18 |
+
|
| 19 |
+
See [PROJECT.md](PROJECT.md) for full documentation.
|
| 20 |
+
|
| 21 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
app.py
ADDED
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
app.py
|
| 3 |
+
ββββββ
|
| 4 |
+
Gradio UI β the entry point for Hugging Face Spaces.
|
| 5 |
+
|
| 6 |
+
This module ONLY handles UI concerns:
|
| 7 |
+
- Layout and theming
|
| 8 |
+
- Wiring user inputs to the RAG pipeline
|
| 9 |
+
- Displaying answers and source citations
|
| 10 |
+
- Error handling / friendly messages
|
| 11 |
+
|
| 12 |
+
It delegates ALL logic to rag_pipeline.py.
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import logging
|
| 16 |
+
import sys
|
| 17 |
+
import gradio as gr
|
| 18 |
+
|
| 19 |
+
from config import cfg
|
| 20 |
+
from rag_pipeline import RAGPipeline, build_pipeline
|
| 21 |
+
|
| 22 |
+
# ββ Gradio version guard ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 23 |
+
# Detect which optional Chatbot kwargs are available in the installed version.
|
| 24 |
+
import inspect as _inspect
|
| 25 |
+
_chatbot_params = set(_inspect.signature(gr.Chatbot.__init__).parameters)
|
| 26 |
+
_SUPPORTS_COPY = "show_copy_button" in _chatbot_params
|
| 27 |
+
_SUPPORTS_BUBBLE = "bubble_full_width" in _chatbot_params
|
| 28 |
+
|
| 29 |
+
# ββ Logging setup βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 30 |
+
logging.basicConfig(
|
| 31 |
+
level=logging.INFO,
|
| 32 |
+
format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
|
| 33 |
+
handlers=[logging.StreamHandler(sys.stdout)],
|
| 34 |
+
)
|
| 35 |
+
logger = logging.getLogger(__name__)
|
| 36 |
+
|
| 37 |
+
# ββ Pipeline (initialised once at startup) ββββββββββββββββββββββββββββββββββββ
|
| 38 |
+
pipeline: RAGPipeline | None = None
|
| 39 |
+
init_error: str | None = None
|
| 40 |
+
|
| 41 |
+
try:
|
| 42 |
+
pipeline = build_pipeline()
|
| 43 |
+
except Exception as exc:
|
| 44 |
+
init_error = str(exc)
|
| 45 |
+
logger.exception("Pipeline initialisation failed: %s", exc)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
# ββ Chat handler ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 49 |
+
|
| 50 |
+
def _msg(role: str, content: str) -> dict:
|
| 51 |
+
"""Return a Gradio-compatible message dict."""
|
| 52 |
+
return {"role": role, "content": content}
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def chat(user_message: str, history: list, show_sources: bool):
|
| 56 |
+
"""
|
| 57 |
+
Called by Gradio on every user message.
|
| 58 |
+
|
| 59 |
+
Parameters
|
| 60 |
+
----------
|
| 61 |
+
user_message : str
|
| 62 |
+
history : list
|
| 63 |
+
Gradio chat history β list of {"role": ..., "content": ...} dicts.
|
| 64 |
+
show_sources : bool
|
| 65 |
+
Whether to append source citations below the answer.
|
| 66 |
+
|
| 67 |
+
Returns
|
| 68 |
+
-------
|
| 69 |
+
tuple[str, list, str]
|
| 70 |
+
(cleared input, updated history, sources markdown)
|
| 71 |
+
"""
|
| 72 |
+
if init_error:
|
| 73 |
+
bot_reply = f"β οΈ **Setup error:** {init_error}\n\nPlease check your Space secrets and logs."
|
| 74 |
+
history = history + [_msg("user", user_message), _msg("assistant", bot_reply)]
|
| 75 |
+
return "", history, ""
|
| 76 |
+
|
| 77 |
+
if not user_message.strip():
|
| 78 |
+
return "", history, ""
|
| 79 |
+
|
| 80 |
+
try:
|
| 81 |
+
response = pipeline.query(user_message) # type: ignore[union-attr]
|
| 82 |
+
bot_reply = response.answer
|
| 83 |
+
sources_md = response.format_sources() if show_sources else ""
|
| 84 |
+
except Exception as exc:
|
| 85 |
+
logger.exception("Error during query: %s", exc)
|
| 86 |
+
bot_reply = "π Something went wrong while consulting the stars. Please try again."
|
| 87 |
+
sources_md = ""
|
| 88 |
+
|
| 89 |
+
history = history + [_msg("user", user_message), _msg("assistant", bot_reply)]
|
| 90 |
+
return "", history, sources_md
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
# ββ Gradio UI βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 94 |
+
|
| 95 |
+
CSS = """
|
| 96 |
+
/* AstroBot custom styles */
|
| 97 |
+
body, .gradio-container { font-family: 'Georgia', serif; }
|
| 98 |
+
.title-banner { text-align: center; padding: 1rem 0 0.5rem; }
|
| 99 |
+
.title-banner h1 { font-size: 2rem; letter-spacing: 0.04em; }
|
| 100 |
+
.disclaimer {
|
| 101 |
+
background: #1a1a2e; color: #a0aec0; border-radius: 8px;
|
| 102 |
+
padding: 0.6rem 1rem; font-size: 0.82rem; margin-bottom: 0.5rem;
|
| 103 |
+
}
|
| 104 |
+
.sources-box { font-size: 0.82rem; color: #718096; }
|
| 105 |
+
footer { display: none !important; }
|
| 106 |
+
"""
|
| 107 |
+
|
| 108 |
+
EXAMPLE_QUESTIONS = [
|
| 109 |
+
"What is the difference between the Sun sign and Rising sign?",
|
| 110 |
+
"Explain what retrograde motion means for planets.",
|
| 111 |
+
"What are the 12 houses in a birth chart?",
|
| 112 |
+
"How do I interpret a conjunction aspect?",
|
| 113 |
+
"What does it mean when Mars is in Aries?",
|
| 114 |
+
"Explain the concept of planetary dignities and debilities.",
|
| 115 |
+
"What is the difference between sidereal and tropical zodiac?",
|
| 116 |
+
"How does the Moon sign influence emotions?",
|
| 117 |
+
]
|
| 118 |
+
|
| 119 |
+
# ββ Gradio version-safe theme βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 120 |
+
_SUPPORTS_THEMES = hasattr(gr, "themes") and hasattr(gr.themes, "Base")
|
| 121 |
+
_theme = gr.themes.Base(
|
| 122 |
+
primary_hue="indigo",
|
| 123 |
+
secondary_hue="purple",
|
| 124 |
+
neutral_hue="slate",
|
| 125 |
+
) if _SUPPORTS_THEMES else None
|
| 126 |
+
|
| 127 |
+
with gr.Blocks(
|
| 128 |
+
title=cfg.app_title,
|
| 129 |
+
theme=_theme,
|
| 130 |
+
css=CSS,
|
| 131 |
+
) as demo:
|
| 132 |
+
|
| 133 |
+
# ββ Header ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 134 |
+
gr.HTML(
|
| 135 |
+
"""
|
| 136 |
+
<div class="title-banner">
|
| 137 |
+
<h1>π AstroBot</h1>
|
| 138 |
+
<p style="color:#9b8ec4; font-size:1.05rem;">
|
| 139 |
+
Your AI Astrology Tutor Β· Powered by Groq LLaMA-3.1-8b-instant
|
| 140 |
+
</p>
|
| 141 |
+
</div>
|
| 142 |
+
"""
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
gr.HTML(
|
| 146 |
+
"""
|
| 147 |
+
<div class="disclaimer">
|
| 148 |
+
π <strong>For students only.</strong>
|
| 149 |
+
AstroBot explains astrological <em>concepts</em> drawn from your course materials.
|
| 150 |
+
It does <strong>not</strong> make personal predictions or interpret individual birth charts.
|
| 151 |
+
</div>
|
| 152 |
+
"""
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
# ββ Main layout βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 156 |
+
with gr.Row():
|
| 157 |
+
with gr.Column(scale=3):
|
| 158 |
+
_chatbot_kwargs = {"label": "AstroBot", "height": 500}
|
| 159 |
+
if _SUPPORTS_BUBBLE:
|
| 160 |
+
_chatbot_kwargs["bubble_full_width"] = False
|
| 161 |
+
if _SUPPORTS_COPY:
|
| 162 |
+
_chatbot_kwargs["show_copy_button"] = True
|
| 163 |
+
if "type" in _chatbot_params:
|
| 164 |
+
_chatbot_kwargs["type"] = "messages" # role/content dict format
|
| 165 |
+
chatbot = gr.Chatbot(**_chatbot_kwargs)
|
| 166 |
+
with gr.Row():
|
| 167 |
+
txt_input = gr.Textbox(
|
| 168 |
+
placeholder="Ask a concept question about astrologyβ¦",
|
| 169 |
+
show_label=False,
|
| 170 |
+
scale=9,
|
| 171 |
+
)
|
| 172 |
+
send_btn = gr.Button("Ask β¨", variant="primary", scale=1)
|
| 173 |
+
|
| 174 |
+
with gr.Column(scale=1):
|
| 175 |
+
gr.Markdown("### βοΈ Options")
|
| 176 |
+
_checkbox_kwargs = {
|
| 177 |
+
"label": "Show source excerpts",
|
| 178 |
+
"value": False,
|
| 179 |
+
}
|
| 180 |
+
_checkbox_params = set(_inspect.signature(gr.Checkbox.__init__).parameters)
|
| 181 |
+
if "info" in _checkbox_params:
|
| 182 |
+
_checkbox_kwargs["info"] = "Display the course material passages used to answer."
|
| 183 |
+
show_sources = gr.Checkbox(**_checkbox_kwargs)
|
| 184 |
+
gr.Markdown("### π‘ Example Questions")
|
| 185 |
+
for q in EXAMPLE_QUESTIONS:
|
| 186 |
+
gr.Button(q, size="sm").click(
|
| 187 |
+
fn=lambda x=q: x, outputs=txt_input
|
| 188 |
+
)
|
| 189 |
+
|
| 190 |
+
# ββ Source citations panel ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 191 |
+
sources_display = gr.Markdown(
|
| 192 |
+
value="",
|
| 193 |
+
label="Source Excerpts",
|
| 194 |
+
elem_classes=["sources-box"],
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
# ββ State ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 198 |
+
state = gr.State([])
|
| 199 |
+
|
| 200 |
+
# ββ Event wiring ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 201 |
+
send_btn.click(
|
| 202 |
+
fn=chat,
|
| 203 |
+
inputs=[txt_input, state, show_sources],
|
| 204 |
+
outputs=[txt_input, chatbot, sources_display],
|
| 205 |
+
)
|
| 206 |
+
txt_input.submit(
|
| 207 |
+
fn=chat,
|
| 208 |
+
inputs=[txt_input, state, show_sources],
|
| 209 |
+
outputs=[txt_input, chatbot, sources_display],
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# ββ Footer ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 213 |
+
gr.Markdown(
|
| 214 |
+
"_Built with [Groq](https://groq.com) Β· [LangChain](https://langchain.com) Β· "
|
| 215 |
+
"[Hugging Face](https://huggingface.co) β for astrology students everywhere π_"
|
| 216 |
+
)
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
# ββ Entry point βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 220 |
+
if __name__ == "__main__":
|
| 221 |
+
demo.launch(server_name="0.0.0.0", server_port=7860)
|
config.py
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
config.py
|
| 3 |
+
βββββββββ
|
| 4 |
+
Central configuration for the AstroBot RAG application.
|
| 5 |
+
All tuneable parameters live here β change once, affects everywhere.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
from dataclasses import dataclass, field
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@dataclass
|
| 13 |
+
class AppConfig:
|
| 14 |
+
# ββ Groq LLM ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 15 |
+
groq_api_key: str = field(default_factory=lambda: os.environ.get("GROQ_API_KEY", ""))
|
| 16 |
+
groq_model: str = "llama-3.1-8b-instant"
|
| 17 |
+
groq_temperature: float = 0.2
|
| 18 |
+
groq_max_tokens: int = 1024
|
| 19 |
+
|
| 20 |
+
# ββ Hugging Face Dataset βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 21 |
+
hf_dataset: str = field(default_factory=lambda: os.environ.get("HF_DATASET", ""))
|
| 22 |
+
hf_token: str = field(default_factory=lambda: os.environ.get("HF_TOKEN", ""))
|
| 23 |
+
dataset_split: str = "train"
|
| 24 |
+
|
| 25 |
+
# Ordered list of candidate column names that hold the raw text
|
| 26 |
+
text_column_candidates: list = field(default_factory=lambda: [
|
| 27 |
+
"text", "content", "body", "page_content", "extracted_text"
|
| 28 |
+
])
|
| 29 |
+
|
| 30 |
+
# ββ Embeddings & Retrieval βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
|
| 32 |
+
chunk_size: int = 512
|
| 33 |
+
chunk_overlap: int = 64
|
| 34 |
+
top_k: int = 5
|
| 35 |
+
|
| 36 |
+
# ββ App Meta βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 37 |
+
app_title: str = "π AstroBot β Astrology Learning Assistant"
|
| 38 |
+
app_description: str = (
|
| 39 |
+
"Ask me anything about astrology concepts β planets, houses, aspects, "
|
| 40 |
+
"signs, transits, chart reading, and more. "
|
| 41 |
+
"**Note:** This bot explains concepts only; no personal predictions are made."
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# Singleton β import this everywhere
|
| 46 |
+
cfg = AppConfig()
|
convert_pdfs.py
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
config.py
|
| 3 |
+
βββββββββ
|
| 4 |
+
Central configuration for the AstroBot RAG application.
|
| 5 |
+
All tuneable parameters live here β change once, affects everywhere.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
from dataclasses import dataclass, field
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@dataclass
|
| 13 |
+
class AppConfig:
|
| 14 |
+
# ββ Groq LLM ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 15 |
+
groq_api_key: str = field(default_factory=lambda: os.environ.get("GROQ_API_KEY", ""))
|
| 16 |
+
groq_model: str = "3.1-8b-instant"
|
| 17 |
+
groq_temperature: float = 0.2
|
| 18 |
+
groq_max_tokens: int = 1024
|
| 19 |
+
|
| 20 |
+
# ββ Hugging Face Dataset βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 21 |
+
hf_dataset: str = field(default_factory=lambda: os.environ.get("HF_DATASET", ""))
|
| 22 |
+
hf_token: str = field(default_factory=lambda: os.environ.get("HF_TOKEN", ""))
|
| 23 |
+
dataset_split: str = "train"
|
| 24 |
+
|
| 25 |
+
# Ordered list of candidate column names that hold the raw text
|
| 26 |
+
text_column_candidates: list = field(default_factory=lambda: [
|
| 27 |
+
"text", "content", "body", "page_content", "extracted_text"
|
| 28 |
+
])
|
| 29 |
+
|
| 30 |
+
# ββ Embeddings & Retrieval βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
|
| 32 |
+
chunk_size: int = 512
|
| 33 |
+
chunk_overlap: int = 64
|
| 34 |
+
top_k: int = 5
|
| 35 |
+
|
| 36 |
+
# ββ App Meta βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 37 |
+
app_title: str = "π AstroBot β Astrology Learning Assistant"
|
| 38 |
+
app_description: str = (
|
| 39 |
+
"Ask me anything about astrology concepts β planets, houses, aspects, "
|
| 40 |
+
"signs, transits, chart reading, and more. "
|
| 41 |
+
"**Note:** This bot explains concepts only; no personal predictions are made."
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# Singleton β import this everywhere
|
| 46 |
+
cfg = AppConfig()
|
data_loader.py
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
data_loader.py
|
| 3 |
+
ββββββββββββββ
|
| 4 |
+
Loads the Parquet-backed PDF dataset from Hugging Face Hub and returns
|
| 5 |
+
a list of LangChain Document objects ready for indexing.
|
| 6 |
+
|
| 7 |
+
Responsibilities:
|
| 8 |
+
- Connect to HF Hub (handles both public and private datasets)
|
| 9 |
+
- Auto-detect the text column
|
| 10 |
+
- Yield Document objects with rich metadata (source file, page number, etc.)
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import logging
|
| 14 |
+
from typing import Optional
|
| 15 |
+
|
| 16 |
+
import pandas as pd
|
| 17 |
+
from datasets import load_dataset
|
| 18 |
+
from langchain_core.documents import Document
|
| 19 |
+
|
| 20 |
+
from config import cfg
|
| 21 |
+
|
| 22 |
+
logger = logging.getLogger(__name__)
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
# ββ Public API ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
+
|
| 27 |
+
def load_documents() -> list[Document]:
|
| 28 |
+
"""
|
| 29 |
+
Entry point: load HF dataset and return chunked-ready Document objects.
|
| 30 |
+
|
| 31 |
+
Returns
|
| 32 |
+
-------
|
| 33 |
+
list[Document]
|
| 34 |
+
One Document per non-empty row, with metadata preserved.
|
| 35 |
+
|
| 36 |
+
Raises
|
| 37 |
+
------
|
| 38 |
+
ValueError
|
| 39 |
+
If the dataset is not configured or no usable text column is found.
|
| 40 |
+
"""
|
| 41 |
+
if not cfg.hf_dataset:
|
| 42 |
+
raise ValueError(
|
| 43 |
+
"HF_DATASET env var is not set. "
|
| 44 |
+
"Set it to 'username/dataset-name' in your Space secrets."
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
df = _fetch_dataframe()
|
| 48 |
+
text_col = _detect_text_column(df)
|
| 49 |
+
documents = _build_documents(df, text_col)
|
| 50 |
+
|
| 51 |
+
logger.info("Loaded %d documents from '%s' (column: '%s')",
|
| 52 |
+
len(documents), cfg.hf_dataset, text_col)
|
| 53 |
+
return documents
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
# ββ Internal helpers ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 57 |
+
|
| 58 |
+
def _fetch_dataframe() -> pd.DataFrame:
|
| 59 |
+
"""Download the dataset split from HF Hub and return as a DataFrame."""
|
| 60 |
+
logger.info("Fetching dataset '%s' split='%s' β¦", cfg.hf_dataset, cfg.dataset_split)
|
| 61 |
+
ds = load_dataset(
|
| 62 |
+
cfg.hf_dataset,
|
| 63 |
+
split=cfg.dataset_split,
|
| 64 |
+
token=cfg.hf_token or None,
|
| 65 |
+
)
|
| 66 |
+
df = ds.to_pandas()
|
| 67 |
+
logger.info("Dataset shape: %s | columns: %s", df.shape, df.columns.tolist())
|
| 68 |
+
return df
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def _detect_text_column(df: pd.DataFrame) -> str:
|
| 72 |
+
"""
|
| 73 |
+
Find the first column whose lowercase name matches a known text-column
|
| 74 |
+
name. Falls back to the first column if none match.
|
| 75 |
+
"""
|
| 76 |
+
col_lower = {c.lower(): c for c in df.columns}
|
| 77 |
+
for candidate in cfg.text_column_candidates:
|
| 78 |
+
if candidate in col_lower:
|
| 79 |
+
return col_lower[candidate]
|
| 80 |
+
|
| 81 |
+
fallback = df.columns[0]
|
| 82 |
+
logger.warning(
|
| 83 |
+
"No known text column found. Falling back to '%s'. "
|
| 84 |
+
"Expected one of: %s",
|
| 85 |
+
fallback, cfg.text_column_candidates,
|
| 86 |
+
)
|
| 87 |
+
return fallback
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def _build_documents(df: pd.DataFrame, text_col: str) -> list[Document]:
|
| 91 |
+
"""Convert DataFrame rows into LangChain Document objects with metadata."""
|
| 92 |
+
meta_cols = [c for c in df.columns if c != text_col]
|
| 93 |
+
|
| 94 |
+
documents: list[Document] = []
|
| 95 |
+
for row_idx, row in df.iterrows():
|
| 96 |
+
text = str(row[text_col]).strip()
|
| 97 |
+
if not text or text.lower() == "nan":
|
| 98 |
+
continue # skip empty rows
|
| 99 |
+
|
| 100 |
+
metadata = {col: str(row.get(col, "")) for col in meta_cols}
|
| 101 |
+
metadata["source_row"] = int(row_idx) # type: ignore[arg-type]
|
| 102 |
+
|
| 103 |
+
documents.append(Document(page_content=text, metadata=metadata))
|
| 104 |
+
|
| 105 |
+
return documents
|
llm.py
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
llm.py
|
| 3 |
+
ββββββ
|
| 4 |
+
Wraps the Groq API client and owns all prompt engineering for AstroBot.
|
| 5 |
+
|
| 6 |
+
Responsibilities:
|
| 7 |
+
- Validate the Groq API key at startup
|
| 8 |
+
- Build the system prompt (astrology tutor persona + no-prediction guardrail)
|
| 9 |
+
- Format retrieved context chunks into the prompt
|
| 10 |
+
- Call the Groq chat completion endpoint and return the answer string
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import logging
|
| 14 |
+
|
| 15 |
+
from groq import Groq
|
| 16 |
+
from langchain_core.documents import Document
|
| 17 |
+
|
| 18 |
+
from config import cfg
|
| 19 |
+
|
| 20 |
+
logger = logging.getLogger(__name__)
|
| 21 |
+
|
| 22 |
+
# ββ System prompt βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 23 |
+
# Defines the bot's persona, scope, and hard guardrails.
|
| 24 |
+
|
| 25 |
+
SYSTEM_TEMPLATE = """You are AstroBot, a patient and knowledgeable astrology tutor.
|
| 26 |
+
Your students are learning astrology concepts. Your role is to:
|
| 27 |
+
β’ Explain astrological concepts clearly and accurately using the provided context.
|
| 28 |
+
β’ Use analogies and examples to make complex ideas approachable.
|
| 29 |
+
β’ Reference classical and modern astrology where relevant.
|
| 30 |
+
β’ Encourage curiosity and deeper study.
|
| 31 |
+
|
| 32 |
+
HARD RULES β never break these:
|
| 33 |
+
1. Do NOT make personal predictions or interpret anyone's birth chart.
|
| 34 |
+
2. Do NOT speculate about future events for specific individuals.
|
| 35 |
+
3. If the context does not contain enough information to answer, say so honestly
|
| 36 |
+
and suggest the student consult a textbook or senior practitioner.
|
| 37 |
+
4. Keep answers focused on educational content only.
|
| 38 |
+
|
| 39 |
+
--- CONTEXT FROM COURSE MATERIALS ---
|
| 40 |
+
{context}
|
| 41 |
+
--- END OF CONTEXT ---
|
| 42 |
+
|
| 43 |
+
Answer the student's question based solely on the context above.
|
| 44 |
+
If the answer isn't in the context, say: "I don't have that in my course materials right now β
|
| 45 |
+
let me point you to further study resources."
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ββ Public API ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 50 |
+
|
| 51 |
+
def create_client() -> Groq:
|
| 52 |
+
"""
|
| 53 |
+
Initialise and validate the Groq client.
|
| 54 |
+
|
| 55 |
+
Raises
|
| 56 |
+
------
|
| 57 |
+
ValueError
|
| 58 |
+
If GROQ_API_KEY is missing.
|
| 59 |
+
"""
|
| 60 |
+
if not cfg.groq_api_key:
|
| 61 |
+
raise ValueError(
|
| 62 |
+
"GROQ_API_KEY is not set. Add it in Space β Settings β Repository secrets."
|
| 63 |
+
)
|
| 64 |
+
logger.info("Groq client initialised (model: %s)", cfg.groq_model)
|
| 65 |
+
return Groq(api_key=cfg.groq_api_key)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def generate_answer(client: Groq, query: str, context_docs: list[Document]) -> str:
|
| 69 |
+
"""
|
| 70 |
+
Build the RAG prompt and call Groq to get an answer.
|
| 71 |
+
|
| 72 |
+
Parameters
|
| 73 |
+
----------
|
| 74 |
+
client : Groq
|
| 75 |
+
Groq client returned by create_client().
|
| 76 |
+
query : str
|
| 77 |
+
The student's question.
|
| 78 |
+
context_docs : list[Document]
|
| 79 |
+
Retrieved chunks from the vector store.
|
| 80 |
+
|
| 81 |
+
Returns
|
| 82 |
+
-------
|
| 83 |
+
str
|
| 84 |
+
The model's answer string.
|
| 85 |
+
"""
|
| 86 |
+
context_text = _format_context(context_docs)
|
| 87 |
+
system_prompt = SYSTEM_TEMPLATE.format(context=context_text)
|
| 88 |
+
|
| 89 |
+
logger.debug("Calling Groq | model=%s | context_chunks=%d", cfg.groq_model, len(context_docs))
|
| 90 |
+
|
| 91 |
+
response = client.chat.completions.create(
|
| 92 |
+
model=cfg.groq_model,
|
| 93 |
+
messages=[
|
| 94 |
+
{"role": "system", "content": system_prompt},
|
| 95 |
+
{"role": "user", "content": query},
|
| 96 |
+
],
|
| 97 |
+
temperature=cfg.groq_temperature,
|
| 98 |
+
max_tokens=cfg.groq_max_tokens,
|
| 99 |
+
)
|
| 100 |
+
|
| 101 |
+
answer = response.choices[0].message.content
|
| 102 |
+
logger.debug("Groq response: %d chars", len(answer))
|
| 103 |
+
return answer
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
# ββ Internal helpers ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 107 |
+
|
| 108 |
+
def _format_context(docs: list[Document]) -> str:
|
| 109 |
+
"""
|
| 110 |
+
Format retrieved documents into a numbered context block
|
| 111 |
+
that is easy for the LLM to parse.
|
| 112 |
+
"""
|
| 113 |
+
blocks = []
|
| 114 |
+
for i, doc in enumerate(docs, 1):
|
| 115 |
+
source = doc.metadata.get("source", doc.metadata.get("source_row", i))
|
| 116 |
+
page = doc.metadata.get("page", "")
|
| 117 |
+
header = f"[Source {i}" + (f" | {source}" if source else "") + (f" | p.{page}" if page else "") + "]"
|
| 118 |
+
blocks.append(f"{header}\n{doc.page_content}")
|
| 119 |
+
return "\n\n".join(blocks)
|
rag_pipeline.py
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
rag_pipeline.py
|
| 3 |
+
βββββββββββββββ
|
| 4 |
+
Orchestrates the full RAG pipeline: query β retrieve β generate β answer.
|
| 5 |
+
|
| 6 |
+
This module is the single integration point between the vector store and
|
| 7 |
+
the LLM. The UI layer (app.py) calls only this module; it knows nothing
|
| 8 |
+
about FAISS or Groq directly.
|
| 9 |
+
|
| 10 |
+
Pipeline steps
|
| 11 |
+
ββββββββββββββ
|
| 12 |
+
1. Validate query (non-empty, reasonable length)
|
| 13 |
+
2. Retrieve top-k relevant chunks from FAISS
|
| 14 |
+
3. Pass chunks + query to the LLM for grounded generation
|
| 15 |
+
4. Return the answer (and optionally the source snippets for transparency)
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
import logging
|
| 19 |
+
from dataclasses import dataclass
|
| 20 |
+
|
| 21 |
+
from groq import Groq
|
| 22 |
+
from langchain_community.vectorstores import FAISS
|
| 23 |
+
from langchain_core.documents import Document
|
| 24 |
+
|
| 25 |
+
import llm as llm_module
|
| 26 |
+
import vector_store as vs_module
|
| 27 |
+
from config import cfg
|
| 28 |
+
|
| 29 |
+
logger = logging.getLogger(__name__)
|
| 30 |
+
|
| 31 |
+
MAX_QUERY_LENGTH = 1000 # characters
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# ββ Data classes ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class RAGResponse:
|
| 38 |
+
answer: str
|
| 39 |
+
sources: list[Document]
|
| 40 |
+
query: str
|
| 41 |
+
|
| 42 |
+
def format_sources(self) -> str:
|
| 43 |
+
"""Return a compact source-citation string for display in the UI."""
|
| 44 |
+
if not self.sources:
|
| 45 |
+
return ""
|
| 46 |
+
lines = []
|
| 47 |
+
for i, doc in enumerate(self.sources, 1):
|
| 48 |
+
src = doc.metadata.get("source", "")
|
| 49 |
+
page = doc.metadata.get("page", "")
|
| 50 |
+
snippet = doc.page_content[:120].replace("\n", " ") + "β¦"
|
| 51 |
+
label = f"**[{i}]**"
|
| 52 |
+
if src:
|
| 53 |
+
label += f" {src}"
|
| 54 |
+
if page:
|
| 55 |
+
label += f" p.{page}"
|
| 56 |
+
lines.append(f"{label}: _{snippet}_")
|
| 57 |
+
return "\n".join(lines)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
# ββ Pipeline class ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 61 |
+
|
| 62 |
+
class RAGPipeline:
|
| 63 |
+
"""
|
| 64 |
+
Stateful pipeline object. Instantiated once at app startup and reused
|
| 65 |
+
for every student query throughout the session.
|
| 66 |
+
"""
|
| 67 |
+
|
| 68 |
+
def __init__(self, index: FAISS, groq_client: Groq) -> None:
|
| 69 |
+
self._index = index
|
| 70 |
+
self._client = groq_client
|
| 71 |
+
logger.info("RAGPipeline ready β")
|
| 72 |
+
|
| 73 |
+
# ββ Public ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 74 |
+
|
| 75 |
+
def query(self, user_query: str) -> RAGResponse:
|
| 76 |
+
"""
|
| 77 |
+
Run the full RAG pipeline for a single student question.
|
| 78 |
+
|
| 79 |
+
Parameters
|
| 80 |
+
----------
|
| 81 |
+
user_query : str
|
| 82 |
+
Raw question text from the student.
|
| 83 |
+
|
| 84 |
+
Returns
|
| 85 |
+
-------
|
| 86 |
+
RAGResponse
|
| 87 |
+
Contains the answer string and the source Documents used.
|
| 88 |
+
"""
|
| 89 |
+
validated = self._validate_query(user_query)
|
| 90 |
+
if validated is None:
|
| 91 |
+
return RAGResponse(
|
| 92 |
+
answer="Please enter a valid question (non-empty, under 1000 characters).",
|
| 93 |
+
sources=[],
|
| 94 |
+
query=user_query,
|
| 95 |
+
)
|
| 96 |
+
|
| 97 |
+
logger.info("Processing query: '%s'", validated[:80])
|
| 98 |
+
|
| 99 |
+
# Step 1 β Retrieve
|
| 100 |
+
context_docs = vs_module.retrieve(self._index, validated, k=cfg.top_k)
|
| 101 |
+
|
| 102 |
+
# Step 2 β Generate
|
| 103 |
+
answer = llm_module.generate_answer(self._client, validated, context_docs)
|
| 104 |
+
|
| 105 |
+
return RAGResponse(answer=answer, sources=context_docs, query=validated)
|
| 106 |
+
|
| 107 |
+
# ββ Internal ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 108 |
+
|
| 109 |
+
@staticmethod
|
| 110 |
+
def _validate_query(query: str) -> str | None:
|
| 111 |
+
"""Return the stripped query if valid, else None."""
|
| 112 |
+
stripped = query.strip()
|
| 113 |
+
if not stripped or len(stripped) > MAX_QUERY_LENGTH:
|
| 114 |
+
return None
|
| 115 |
+
return stripped
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
# ββ Factory function βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 119 |
+
|
| 120 |
+
def build_pipeline() -> RAGPipeline:
|
| 121 |
+
"""
|
| 122 |
+
Convenience factory: load data, build index, init LLM, return pipeline.
|
| 123 |
+
Import and call this once from app.py.
|
| 124 |
+
"""
|
| 125 |
+
from data_loader import load_documents # local import avoids circular deps
|
| 126 |
+
|
| 127 |
+
logger.info("=== Building AstroBot RAG Pipeline ===")
|
| 128 |
+
|
| 129 |
+
docs = load_documents()
|
| 130 |
+
index = vs_module.build_index(docs)
|
| 131 |
+
client = llm_module.create_client()
|
| 132 |
+
pipeline = RAGPipeline(index=index, groq_client=client)
|
| 133 |
+
|
| 134 |
+
logger.info("=== AstroBot Pipeline Ready β ===")
|
| 135 |
+
return pipeline
|
requirements.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
groq>=0.5.0
|
| 2 |
+
gradio>=4.36.0
|
| 3 |
+
datasets>=2.19.0
|
| 4 |
+
pandas>=2.0.0
|
| 5 |
+
langchain>=0.2.0
|
| 6 |
+
langchain-core>=0.2.0
|
| 7 |
+
langchain-community>=0.2.0
|
| 8 |
+
faiss-cpu>=1.8.0
|
| 9 |
+
sentence-transformers>=3.0.0
|
| 10 |
+
huggingface-hub>=0.23.0
|
| 11 |
+
langchain-text-splitters>=0.2.0
|
vector_store.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
vector_store.py
|
| 3 |
+
βββββββββββββββ
|
| 4 |
+
Handles text chunking, embedding, and FAISS vector store creation/querying.
|
| 5 |
+
|
| 6 |
+
Responsibilities:
|
| 7 |
+
- Split raw Documents into overlapping chunks
|
| 8 |
+
- Embed chunks using a local HuggingFace sentence-transformer
|
| 9 |
+
- Build and expose a FAISS index for similarity search
|
| 10 |
+
- Provide a clean retrieve() function used by the RAG pipeline
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import logging
|
| 14 |
+
|
| 15 |
+
from langchain_core.documents import Document
|
| 16 |
+
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
| 17 |
+
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 18 |
+
from langchain_community.vectorstores import FAISS
|
| 19 |
+
|
| 20 |
+
from config import cfg
|
| 21 |
+
|
| 22 |
+
logger = logging.getLogger(__name__)
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
# ββ Public API ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
+
|
| 27 |
+
def build_index(documents: list[Document]) -> FAISS:
|
| 28 |
+
"""
|
| 29 |
+
Chunk β embed β index the supplied documents.
|
| 30 |
+
|
| 31 |
+
Parameters
|
| 32 |
+
----------
|
| 33 |
+
documents : list[Document]
|
| 34 |
+
Raw documents returned by data_loader.load_documents().
|
| 35 |
+
|
| 36 |
+
Returns
|
| 37 |
+
-------
|
| 38 |
+
FAISS
|
| 39 |
+
A ready-to-query FAISS vector store.
|
| 40 |
+
"""
|
| 41 |
+
chunks = _chunk_documents(documents)
|
| 42 |
+
embeddings = _load_embeddings()
|
| 43 |
+
index = _create_faiss_index(chunks, embeddings)
|
| 44 |
+
return index
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def retrieve(index: FAISS, query: str, k: int | None = None) -> list[Document]:
|
| 48 |
+
"""
|
| 49 |
+
Retrieve the top-k most relevant chunks for a given query.
|
| 50 |
+
|
| 51 |
+
Parameters
|
| 52 |
+
----------
|
| 53 |
+
index : FAISS
|
| 54 |
+
The FAISS vector store built by build_index().
|
| 55 |
+
query : str
|
| 56 |
+
The user's natural-language question.
|
| 57 |
+
k : int, optional
|
| 58 |
+
Number of results to return. Defaults to cfg.top_k.
|
| 59 |
+
|
| 60 |
+
Returns
|
| 61 |
+
-------
|
| 62 |
+
list[Document]
|
| 63 |
+
Retrieved chunks, most relevant first.
|
| 64 |
+
"""
|
| 65 |
+
k = k or cfg.top_k
|
| 66 |
+
results = index.similarity_search(query, k=k)
|
| 67 |
+
logger.debug("Retrieved %d chunks for query: '%s'", len(results), query[:80])
|
| 68 |
+
return results
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
# ββ Internal helpers ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 72 |
+
|
| 73 |
+
def _chunk_documents(documents: list[Document]) -> list[Document]:
|
| 74 |
+
"""Split documents into smaller overlapping chunks."""
|
| 75 |
+
splitter = RecursiveCharacterTextSplitter(
|
| 76 |
+
chunk_size=cfg.chunk_size,
|
| 77 |
+
chunk_overlap=cfg.chunk_overlap,
|
| 78 |
+
separators=["\n\n", "\n", ". ", " ", ""],
|
| 79 |
+
)
|
| 80 |
+
chunks = splitter.split_documents(documents)
|
| 81 |
+
logger.info(
|
| 82 |
+
"Chunking: %d raw docs β %d chunks (size=%d, overlap=%d)",
|
| 83 |
+
len(documents), len(chunks), cfg.chunk_size, cfg.chunk_overlap,
|
| 84 |
+
)
|
| 85 |
+
return chunks
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def _load_embeddings() -> HuggingFaceEmbeddings:
|
| 89 |
+
"""Load the local sentence-transformer embedding model (cached after first call)."""
|
| 90 |
+
logger.info("Loading embedding model: %s", cfg.embed_model)
|
| 91 |
+
return HuggingFaceEmbeddings(
|
| 92 |
+
model_name=cfg.embed_model,
|
| 93 |
+
model_kwargs={"device": "cpu"},
|
| 94 |
+
encode_kwargs={"normalize_embeddings": True},
|
| 95 |
+
)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def _create_faiss_index(chunks: list[Document], embeddings: HuggingFaceEmbeddings) -> FAISS:
|
| 99 |
+
"""Embed all chunks and build the FAISS index."""
|
| 100 |
+
logger.info("Building FAISS index over %d chunks β¦", len(chunks))
|
| 101 |
+
index = FAISS.from_documents(chunks, embeddings)
|
| 102 |
+
logger.info("FAISS index built β (vectors: %d)", index.index.ntotal)
|
| 103 |
+
return index
|