OnlyTheTruth03 commited on
Commit
721ca73
Β·
verified Β·
1 Parent(s): 190236c

Initial Commit

Browse files
Files changed (10) hide show
  1. PROJECT.md +277 -0
  2. README.md +21 -13
  3. app.py +221 -0
  4. config.py +46 -0
  5. convert_pdfs.py +46 -0
  6. data_loader.py +105 -0
  7. llm.py +119 -0
  8. rag_pipeline.py +135 -0
  9. requirements.txt +11 -0
  10. vector_store.py +103 -0
PROJECT.md ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ”­ AstroBot β€” RAG-Powered educational AI System
2
+
3
+ >AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
4
+ >It demonstrates:
5
+ >End-to-end PDF ingestion β†’ structured Parquet datasets
6
+ >Semantic indexing with FAISS
7
+ >Context-grounded LLM responses via Groq (LLaMA-3)
8
+ >Modular architecture enabling easy LLM or vector DB swapping
9
+ >Public deployment on Hugging Face Spaces (CI/CD via git push)
10
+
11
+ ---
12
+
13
+ ## Table of Contents
14
+
15
+ 1. [Project Overview](#project-overview)
16
+ 2. [Tech Stack](#tech-stack)
17
+ 3. [Architecture](#architecture)
18
+ 4. [File Structure](#file-structure)
19
+ 5. [Module Responsibilities](#module-responsibilities)
20
+ 6. [Data Pipeline](#data-pipeline)
21
+ 7. [Setup & Deployment](#setup--deployment)
22
+ 8. [Environment Variables](#environment-variables)
23
+ 9. [How to Add New Course Materials](#how-to-add-new-course-materials)
24
+ 10. [Limitations & Guardrails](#limitations--guardrails)
25
+ 11. [Troubleshooting](#troubleshooting)
26
+
27
+
28
+ ---
29
+
30
+ ## Project Overview
31
+
32
+ AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**.
33
+ It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.
34
+
35
+ ## Tech Stack
36
+
37
+ | Layer | Technology | Why |
38
+ |---|---|---|
39
+ | LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous |
40
+ | Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space |
41
+ | Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally |
42
+ | Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well |
43
+ | Framework | **LangChain** | Chunking utilities and Document schema |
44
+ | UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly |
45
+ | Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push |
46
+
47
+ ### What it does
48
+ - Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
49
+ - Grounds every answer in actual course material (no hallucination of unsupported facts).
50
+ - Clearly declines to make personal predictions or interpret individual birth charts.
51
+
52
+ ### What it does NOT do
53
+ - Make predictions of any kind.
54
+ - Interpret a specific person's chart.
55
+ - Answer questions unrelated to astrology concepts.
56
+
57
+ ---
58
+
59
+ ## Architecture
60
+
61
+ ```
62
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
63
+ β”‚ OFFLINE (once) β”‚
64
+ β”‚ β”‚
65
+ β”‚ Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet) β”‚
66
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
67
+ β”‚
68
+ β–Ό
69
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
70
+ β”‚ HF SPACE (at startup) β”‚
71
+ β”‚ β”‚
72
+ β”‚ data_loader.py β”‚
73
+ β”‚ └── load_dataset() from HF Hub ──► list[Document] β”‚
74
+ β”‚ β”‚
75
+ β”‚ vector_store.py β”‚
76
+ β”‚ β”œβ”€β”€ RecursiveCharacterTextSplitter ──► Chunks β”‚
77
+ β”‚ β”œβ”€β”€ HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors β”‚
78
+ β”‚ └── FAISS.from_documents() ──► Index β”‚
79
+ β”‚ β”‚
80
+ β”‚ llm.py β”‚
81
+ β”‚ └── Groq(api_key) ──► Groq Client β”‚
82
+ β”‚ β”‚
83
+ β”‚ rag_pipeline.py β”‚
84
+ β”‚ └── RAGPipeline(index, groq_client) ──► Ready β”‚
85
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
86
+ β”‚
87
+ β–Ό
88
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
89
+ β”‚ HF SPACE (per query) β”‚
90
+ β”‚ β”‚
91
+ β”‚ Student Question β”‚
92
+ β”‚ β”‚ β”‚
93
+ β”‚ β–Ό β”‚
94
+ β”‚ rag_pipeline.query() β”‚
95
+ β”‚ β”œβ”€β”€ vector_store.retrieve() ──► Top-K Chunks β”‚
96
+ β”‚ └── llm.generate_answer() ──► Grounded Answer β”‚
97
+ β”‚ β”‚
98
+ β”‚ app.py ──► Gradio UI ──► Student sees answer β”‚
99
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
100
+ ```
101
+
102
+ ---
103
+
104
+ ## File Structure
105
+
106
+ ```
107
+ astrobot/
108
+ β”‚
109
+ β”œβ”€β”€ app.py # Gradio UI β€” entry point for HF Spaces
110
+ β”œβ”€β”€ config.py # All configuration (env vars, hyperparameters)
111
+ β”œβ”€β”€ data_loader.py # HF dataset fetching + Document creation
112
+ β”œβ”€β”€ vector_store.py # Chunking, embedding, FAISS index
113
+ β”œβ”€β”€ llm.py # Groq client + prompt engineering
114
+ β”œβ”€β”€ rag_pipeline.py # Orchestrates retrieval β†’ generation
115
+ β”‚
116
+ β”œβ”€β”€ convert_pdfs.py # Offline helper: PDFs β†’ HF Parquet dataset
117
+ β”œβ”€β”€ requirements.txt # Python dependencies
118
+ └── PROJECT.md # This file
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Module Responsibilities
124
+
125
+ | Module | Single Responsibility |
126
+ |---|---|
127
+ | `config.py` | Central source of truth for all settings. Change a parameter once here. |
128
+ | `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. |
129
+ | `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. |
130
+ | `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. |
131
+ | `rag_pipeline.py` | Glue layer: validate query β†’ retrieve β†’ generate β†’ return `RAGResponse`. |
132
+ | `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. |
133
+ | `convert_pdfs.py` | One-time offline script: extract PDF pages β†’ push Parquet to HF Hub. |
134
+
135
+ This separation means:
136
+ - You can swap **FAISS β†’ Pinecone** by editing only `vector_store.py`.
137
+ - You can swap **Groq β†’ OpenAI** by editing only `llm.py`.
138
+ - You can change the **system prompt** (persona, guardrails) in only `llm.py`.
139
+ - You can replace the **UI** without touching any backend logic.
140
+
141
+
142
+ ---
143
+
144
+ ## Data Pipeline
145
+
146
+ ### Step 1 β€” Prepare your PDFs (run locally)
147
+
148
+ Place your astrology textbook PDFs in a folder and run:
149
+
150
+ ```bash
151
+ pip install pypdf datasets huggingface-hub
152
+ python convert_pdfs.py \
153
+ --pdf_dir ./astrology_books \
154
+ --repo_id YOUR_USERNAME/astrology-course-materials \
155
+ --private # optional
156
+ ```
157
+
158
+ This will:
159
+ 1. Extract text from each PDF page-by-page.
160
+ 2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
161
+ 3. Push it to HF Hub as a Parquet-backed dataset.
162
+
163
+ ### Step 2 β€” Connect to the Space
164
+
165
+ Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).
166
+
167
+ ### Step 3 β€” What happens at startup
168
+
169
+ ```
170
+ load_dataset() # ~30s for large datasets
171
+ RecursiveCharacterTextSplitter # chunk_size=512, overlap=64
172
+ HuggingFaceEmbeddings # ~60s to encode all chunks
173
+ FAISS.from_documents() # <5s
174
+ ```
175
+
176
+ The index is built once per Space restart and held in memory.
177
+
178
+ ---
179
+
180
+ ## Setup & Deployment
181
+
182
+ ### 1. Create a Hugging Face Space
183
+
184
+ - Go to [huggingface.co/new-space](https://huggingface.co/new-space)
185
+ - **SDK:** Gradio
186
+ - **Hardware:** CPU Basic (free)
187
+
188
+ ### 2. Upload files
189
+
190
+ Upload these files to the Space repository:
191
+ ```
192
+ app.py
193
+ config.py
194
+ data_loader.py
195
+ vector_store.py
196
+ llm.py
197
+ rag_pipeline.py
198
+ requirements.txt
199
+ ```
200
+
201
+ ### 3. Set secrets
202
+
203
+ Go to **Space β†’ Settings β†’ Repository secrets β†’ New secret**
204
+
205
+ | Secret Name | Value |
206
+ |---|---|
207
+ | `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) β†’ API Keys |
208
+ | `HF_DATASET` | `your-username/your-dataset-name` |
209
+ | `HF_TOKEN` | Your HF token (only needed for **private** datasets) |
210
+
211
+ ### 4. Done
212
+
213
+ The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).
214
+
215
+ ---
216
+
217
+ ## Environment Variables
218
+
219
+ All variables are read in `config.py`. You can also set them locally for development:
220
+
221
+ ```bash
222
+ export GROQ_API_KEY="gsk_..."
223
+ export HF_DATASET="yourname/astrology-course-materials"
224
+ export HF_TOKEN="" # leave blank for public datasets
225
+
226
+ python app.py
227
+ ```
228
+
229
+ ---
230
+
231
+ ## How to Add New Course Materials
232
+
233
+ 1. Add the new PDF(s) to your `./astrology_books/` folder.
234
+ 2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
235
+ 3. **Restart the HF Space** β€” it will re-index on next startup.
236
+
237
+ No code changes required.
238
+
239
+ ---
240
+
241
+ ## Limitations & Guardrails
242
+
243
+ | Limitation | Detail |
244
+ |---|---|
245
+ | **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. |
246
+ | **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. |
247
+ | **No chart interpretation** | Questions about specific birth charts are declined. |
248
+ | **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3–5 min cold start). |
249
+ | **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. |
250
+ | **Language** | Optimised for English. Other languages may work but are untested. |
251
+
252
+ ---
253
+
254
+ ## Troubleshooting
255
+
256
+ ### Space fails to start
257
+ - Check **Logs** tab in the Space for Python errors.
258
+ - Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).
259
+
260
+ ### "GROQ_API_KEY is not set"
261
+ - Add the secret in Space β†’ Settings β†’ Repository secrets.
262
+
263
+ ### "No usable text column found"
264
+ - Your Parquet dataset doesn't have a column named `text`, `content`, etc.
265
+ - Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.
266
+
267
+ ### Answers seem unrelated to the question
268
+ - Increase `TOP_K` in `config.py` (try 7–10).
269
+ - Decrease `CHUNK_SIZE` (try 256) for finer granularity.
270
+ - Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.
271
+
272
+ ### Groq rate limit errors
273
+ - Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.
274
+
275
+ ---
276
+
277
+
README.md CHANGED
@@ -1,13 +1,21 @@
1
- ---
2
- title: DemoChatBot
3
- emoji: ⚑
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 6.6.0
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Astrobot
3
+ emoji: πŸ“ˆ
4
+ colorFrom: pink
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 6.6.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
+ # πŸ”­ AstroBot
13
+
14
+ **RAG-powered astrology tutor for students.**
15
+ Ask about planets, houses, signs, aspects, transits β€” grounded in your course materials.
16
+
17
+ > πŸ“š Explains concepts only Β· No personal predictions Β· No chart readings
18
+
19
+ See [PROJECT.md](PROJECT.md) for full documentation.
20
+
21
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py
3
+ ──────
4
+ Gradio UI β€” the entry point for Hugging Face Spaces.
5
+
6
+ This module ONLY handles UI concerns:
7
+ - Layout and theming
8
+ - Wiring user inputs to the RAG pipeline
9
+ - Displaying answers and source citations
10
+ - Error handling / friendly messages
11
+
12
+ It delegates ALL logic to rag_pipeline.py.
13
+ """
14
+
15
+ import logging
16
+ import sys
17
+ import gradio as gr
18
+
19
+ from config import cfg
20
+ from rag_pipeline import RAGPipeline, build_pipeline
21
+
22
+ # ── Gradio version guard ──────────────────────────────────────────────────────
23
+ # Detect which optional Chatbot kwargs are available in the installed version.
24
+ import inspect as _inspect
25
+ _chatbot_params = set(_inspect.signature(gr.Chatbot.__init__).parameters)
26
+ _SUPPORTS_COPY = "show_copy_button" in _chatbot_params
27
+ _SUPPORTS_BUBBLE = "bubble_full_width" in _chatbot_params
28
+
29
+ # ── Logging setup ─────────────────────────────────────────────────────────────
30
+ logging.basicConfig(
31
+ level=logging.INFO,
32
+ format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
33
+ handlers=[logging.StreamHandler(sys.stdout)],
34
+ )
35
+ logger = logging.getLogger(__name__)
36
+
37
+ # ── Pipeline (initialised once at startup) ────────────────────────────────────
38
+ pipeline: RAGPipeline | None = None
39
+ init_error: str | None = None
40
+
41
+ try:
42
+ pipeline = build_pipeline()
43
+ except Exception as exc:
44
+ init_error = str(exc)
45
+ logger.exception("Pipeline initialisation failed: %s", exc)
46
+
47
+
48
+ # ── Chat handler ──────────────────────────────────────────────────────────────
49
+
50
+ def _msg(role: str, content: str) -> dict:
51
+ """Return a Gradio-compatible message dict."""
52
+ return {"role": role, "content": content}
53
+
54
+
55
+ def chat(user_message: str, history: list, show_sources: bool):
56
+ """
57
+ Called by Gradio on every user message.
58
+
59
+ Parameters
60
+ ----------
61
+ user_message : str
62
+ history : list
63
+ Gradio chat history β€” list of {"role": ..., "content": ...} dicts.
64
+ show_sources : bool
65
+ Whether to append source citations below the answer.
66
+
67
+ Returns
68
+ -------
69
+ tuple[str, list, str]
70
+ (cleared input, updated history, sources markdown)
71
+ """
72
+ if init_error:
73
+ bot_reply = f"⚠️ **Setup error:** {init_error}\n\nPlease check your Space secrets and logs."
74
+ history = history + [_msg("user", user_message), _msg("assistant", bot_reply)]
75
+ return "", history, ""
76
+
77
+ if not user_message.strip():
78
+ return "", history, ""
79
+
80
+ try:
81
+ response = pipeline.query(user_message) # type: ignore[union-attr]
82
+ bot_reply = response.answer
83
+ sources_md = response.format_sources() if show_sources else ""
84
+ except Exception as exc:
85
+ logger.exception("Error during query: %s", exc)
86
+ bot_reply = "πŸ”­ Something went wrong while consulting the stars. Please try again."
87
+ sources_md = ""
88
+
89
+ history = history + [_msg("user", user_message), _msg("assistant", bot_reply)]
90
+ return "", history, sources_md
91
+
92
+
93
+ # ── Gradio UI ─────────────────────────────────────────────────────────────────
94
+
95
+ CSS = """
96
+ /* AstroBot custom styles */
97
+ body, .gradio-container { font-family: 'Georgia', serif; }
98
+ .title-banner { text-align: center; padding: 1rem 0 0.5rem; }
99
+ .title-banner h1 { font-size: 2rem; letter-spacing: 0.04em; }
100
+ .disclaimer {
101
+ background: #1a1a2e; color: #a0aec0; border-radius: 8px;
102
+ padding: 0.6rem 1rem; font-size: 0.82rem; margin-bottom: 0.5rem;
103
+ }
104
+ .sources-box { font-size: 0.82rem; color: #718096; }
105
+ footer { display: none !important; }
106
+ """
107
+
108
+ EXAMPLE_QUESTIONS = [
109
+ "What is the difference between the Sun sign and Rising sign?",
110
+ "Explain what retrograde motion means for planets.",
111
+ "What are the 12 houses in a birth chart?",
112
+ "How do I interpret a conjunction aspect?",
113
+ "What does it mean when Mars is in Aries?",
114
+ "Explain the concept of planetary dignities and debilities.",
115
+ "What is the difference between sidereal and tropical zodiac?",
116
+ "How does the Moon sign influence emotions?",
117
+ ]
118
+
119
+ # ── Gradio version-safe theme ─────────────────────────────────────────────────
120
+ _SUPPORTS_THEMES = hasattr(gr, "themes") and hasattr(gr.themes, "Base")
121
+ _theme = gr.themes.Base(
122
+ primary_hue="indigo",
123
+ secondary_hue="purple",
124
+ neutral_hue="slate",
125
+ ) if _SUPPORTS_THEMES else None
126
+
127
+ with gr.Blocks(
128
+ title=cfg.app_title,
129
+ theme=_theme,
130
+ css=CSS,
131
+ ) as demo:
132
+
133
+ # ── Header ────────────────────────────────────────────────────────────────
134
+ gr.HTML(
135
+ """
136
+ <div class="title-banner">
137
+ <h1>πŸ”­ AstroBot</h1>
138
+ <p style="color:#9b8ec4; font-size:1.05rem;">
139
+ Your AI Astrology Tutor Β· Powered by Groq LLaMA-3.1-8b-instant
140
+ </p>
141
+ </div>
142
+ """
143
+ )
144
+
145
+ gr.HTML(
146
+ """
147
+ <div class="disclaimer">
148
+ πŸ“š <strong>For students only.</strong>
149
+ AstroBot explains astrological <em>concepts</em> drawn from your course materials.
150
+ It does <strong>not</strong> make personal predictions or interpret individual birth charts.
151
+ </div>
152
+ """
153
+ )
154
+
155
+ # ── Main layout ───────────────────────────────────────────────────────────
156
+ with gr.Row():
157
+ with gr.Column(scale=3):
158
+ _chatbot_kwargs = {"label": "AstroBot", "height": 500}
159
+ if _SUPPORTS_BUBBLE:
160
+ _chatbot_kwargs["bubble_full_width"] = False
161
+ if _SUPPORTS_COPY:
162
+ _chatbot_kwargs["show_copy_button"] = True
163
+ if "type" in _chatbot_params:
164
+ _chatbot_kwargs["type"] = "messages" # role/content dict format
165
+ chatbot = gr.Chatbot(**_chatbot_kwargs)
166
+ with gr.Row():
167
+ txt_input = gr.Textbox(
168
+ placeholder="Ask a concept question about astrology…",
169
+ show_label=False,
170
+ scale=9,
171
+ )
172
+ send_btn = gr.Button("Ask ✨", variant="primary", scale=1)
173
+
174
+ with gr.Column(scale=1):
175
+ gr.Markdown("### βš™οΈ Options")
176
+ _checkbox_kwargs = {
177
+ "label": "Show source excerpts",
178
+ "value": False,
179
+ }
180
+ _checkbox_params = set(_inspect.signature(gr.Checkbox.__init__).parameters)
181
+ if "info" in _checkbox_params:
182
+ _checkbox_kwargs["info"] = "Display the course material passages used to answer."
183
+ show_sources = gr.Checkbox(**_checkbox_kwargs)
184
+ gr.Markdown("### πŸ’‘ Example Questions")
185
+ for q in EXAMPLE_QUESTIONS:
186
+ gr.Button(q, size="sm").click(
187
+ fn=lambda x=q: x, outputs=txt_input
188
+ )
189
+
190
+ # ── Source citations panel ────────────────────────────────────────────────
191
+ sources_display = gr.Markdown(
192
+ value="",
193
+ label="Source Excerpts",
194
+ elem_classes=["sources-box"],
195
+ )
196
+
197
+ # ── State ────────────────────────────────────────────────────────────────
198
+ state = gr.State([])
199
+
200
+ # ── Event wiring ──────────────────────────────────────────────────────────
201
+ send_btn.click(
202
+ fn=chat,
203
+ inputs=[txt_input, state, show_sources],
204
+ outputs=[txt_input, chatbot, sources_display],
205
+ )
206
+ txt_input.submit(
207
+ fn=chat,
208
+ inputs=[txt_input, state, show_sources],
209
+ outputs=[txt_input, chatbot, sources_display],
210
+ )
211
+
212
+ # ── Footer ────────────────────────────────────────────────────────────────
213
+ gr.Markdown(
214
+ "_Built with [Groq](https://groq.com) Β· [LangChain](https://langchain.com) Β· "
215
+ "[Hugging Face](https://huggingface.co) β€” for astrology students everywhere πŸŒ™_"
216
+ )
217
+
218
+
219
+ # ── Entry point ───────────────────────────────────────────────────────────────
220
+ if __name__ == "__main__":
221
+ demo.launch(server_name="0.0.0.0", server_port=7860)
config.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ config.py
3
+ ─────────
4
+ Central configuration for the AstroBot RAG application.
5
+ All tuneable parameters live here β€” change once, affects everywhere.
6
+ """
7
+
8
+ import os
9
+ from dataclasses import dataclass, field
10
+
11
+
12
+ @dataclass
13
+ class AppConfig:
14
+ # ── Groq LLM ──────────────────────────────────────────────────────────────
15
+ groq_api_key: str = field(default_factory=lambda: os.environ.get("GROQ_API_KEY", ""))
16
+ groq_model: str = "llama-3.1-8b-instant"
17
+ groq_temperature: float = 0.2
18
+ groq_max_tokens: int = 1024
19
+
20
+ # ── Hugging Face Dataset ───────────────────────────────────────────────────
21
+ hf_dataset: str = field(default_factory=lambda: os.environ.get("HF_DATASET", ""))
22
+ hf_token: str = field(default_factory=lambda: os.environ.get("HF_TOKEN", ""))
23
+ dataset_split: str = "train"
24
+
25
+ # Ordered list of candidate column names that hold the raw text
26
+ text_column_candidates: list = field(default_factory=lambda: [
27
+ "text", "content", "body", "page_content", "extracted_text"
28
+ ])
29
+
30
+ # ── Embeddings & Retrieval ─────────────────────────────────────────────────
31
+ embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
32
+ chunk_size: int = 512
33
+ chunk_overlap: int = 64
34
+ top_k: int = 5
35
+
36
+ # ── App Meta ───────────────────────────────────────────────────────────────
37
+ app_title: str = "πŸ”­ AstroBot β€” Astrology Learning Assistant"
38
+ app_description: str = (
39
+ "Ask me anything about astrology concepts β€” planets, houses, aspects, "
40
+ "signs, transits, chart reading, and more. "
41
+ "**Note:** This bot explains concepts only; no personal predictions are made."
42
+ )
43
+
44
+
45
+ # Singleton β€” import this everywhere
46
+ cfg = AppConfig()
convert_pdfs.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ config.py
3
+ ─────────
4
+ Central configuration for the AstroBot RAG application.
5
+ All tuneable parameters live here β€” change once, affects everywhere.
6
+ """
7
+
8
+ import os
9
+ from dataclasses import dataclass, field
10
+
11
+
12
+ @dataclass
13
+ class AppConfig:
14
+ # ── Groq LLM ──────────────────────────────────────────────────────────────
15
+ groq_api_key: str = field(default_factory=lambda: os.environ.get("GROQ_API_KEY", ""))
16
+ groq_model: str = "3.1-8b-instant"
17
+ groq_temperature: float = 0.2
18
+ groq_max_tokens: int = 1024
19
+
20
+ # ── Hugging Face Dataset ───────────────────────────────────────────────────
21
+ hf_dataset: str = field(default_factory=lambda: os.environ.get("HF_DATASET", ""))
22
+ hf_token: str = field(default_factory=lambda: os.environ.get("HF_TOKEN", ""))
23
+ dataset_split: str = "train"
24
+
25
+ # Ordered list of candidate column names that hold the raw text
26
+ text_column_candidates: list = field(default_factory=lambda: [
27
+ "text", "content", "body", "page_content", "extracted_text"
28
+ ])
29
+
30
+ # ── Embeddings & Retrieval ─────────────────────────────────────────────────
31
+ embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
32
+ chunk_size: int = 512
33
+ chunk_overlap: int = 64
34
+ top_k: int = 5
35
+
36
+ # ── App Meta ───────────────────────────────────────────────────────────────
37
+ app_title: str = "πŸ”­ AstroBot β€” Astrology Learning Assistant"
38
+ app_description: str = (
39
+ "Ask me anything about astrology concepts β€” planets, houses, aspects, "
40
+ "signs, transits, chart reading, and more. "
41
+ "**Note:** This bot explains concepts only; no personal predictions are made."
42
+ )
43
+
44
+
45
+ # Singleton β€” import this everywhere
46
+ cfg = AppConfig()
data_loader.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ data_loader.py
3
+ ──────────────
4
+ Loads the Parquet-backed PDF dataset from Hugging Face Hub and returns
5
+ a list of LangChain Document objects ready for indexing.
6
+
7
+ Responsibilities:
8
+ - Connect to HF Hub (handles both public and private datasets)
9
+ - Auto-detect the text column
10
+ - Yield Document objects with rich metadata (source file, page number, etc.)
11
+ """
12
+
13
+ import logging
14
+ from typing import Optional
15
+
16
+ import pandas as pd
17
+ from datasets import load_dataset
18
+ from langchain_core.documents import Document
19
+
20
+ from config import cfg
21
+
22
+ logger = logging.getLogger(__name__)
23
+
24
+
25
+ # ── Public API ────────────────────────────────────────────────────────────────
26
+
27
+ def load_documents() -> list[Document]:
28
+ """
29
+ Entry point: load HF dataset and return chunked-ready Document objects.
30
+
31
+ Returns
32
+ -------
33
+ list[Document]
34
+ One Document per non-empty row, with metadata preserved.
35
+
36
+ Raises
37
+ ------
38
+ ValueError
39
+ If the dataset is not configured or no usable text column is found.
40
+ """
41
+ if not cfg.hf_dataset:
42
+ raise ValueError(
43
+ "HF_DATASET env var is not set. "
44
+ "Set it to 'username/dataset-name' in your Space secrets."
45
+ )
46
+
47
+ df = _fetch_dataframe()
48
+ text_col = _detect_text_column(df)
49
+ documents = _build_documents(df, text_col)
50
+
51
+ logger.info("Loaded %d documents from '%s' (column: '%s')",
52
+ len(documents), cfg.hf_dataset, text_col)
53
+ return documents
54
+
55
+
56
+ # ── Internal helpers ──────────────────────────────────────────────────────────
57
+
58
+ def _fetch_dataframe() -> pd.DataFrame:
59
+ """Download the dataset split from HF Hub and return as a DataFrame."""
60
+ logger.info("Fetching dataset '%s' split='%s' …", cfg.hf_dataset, cfg.dataset_split)
61
+ ds = load_dataset(
62
+ cfg.hf_dataset,
63
+ split=cfg.dataset_split,
64
+ token=cfg.hf_token or None,
65
+ )
66
+ df = ds.to_pandas()
67
+ logger.info("Dataset shape: %s | columns: %s", df.shape, df.columns.tolist())
68
+ return df
69
+
70
+
71
+ def _detect_text_column(df: pd.DataFrame) -> str:
72
+ """
73
+ Find the first column whose lowercase name matches a known text-column
74
+ name. Falls back to the first column if none match.
75
+ """
76
+ col_lower = {c.lower(): c for c in df.columns}
77
+ for candidate in cfg.text_column_candidates:
78
+ if candidate in col_lower:
79
+ return col_lower[candidate]
80
+
81
+ fallback = df.columns[0]
82
+ logger.warning(
83
+ "No known text column found. Falling back to '%s'. "
84
+ "Expected one of: %s",
85
+ fallback, cfg.text_column_candidates,
86
+ )
87
+ return fallback
88
+
89
+
90
+ def _build_documents(df: pd.DataFrame, text_col: str) -> list[Document]:
91
+ """Convert DataFrame rows into LangChain Document objects with metadata."""
92
+ meta_cols = [c for c in df.columns if c != text_col]
93
+
94
+ documents: list[Document] = []
95
+ for row_idx, row in df.iterrows():
96
+ text = str(row[text_col]).strip()
97
+ if not text or text.lower() == "nan":
98
+ continue # skip empty rows
99
+
100
+ metadata = {col: str(row.get(col, "")) for col in meta_cols}
101
+ metadata["source_row"] = int(row_idx) # type: ignore[arg-type]
102
+
103
+ documents.append(Document(page_content=text, metadata=metadata))
104
+
105
+ return documents
llm.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ llm.py
3
+ ──────
4
+ Wraps the Groq API client and owns all prompt engineering for AstroBot.
5
+
6
+ Responsibilities:
7
+ - Validate the Groq API key at startup
8
+ - Build the system prompt (astrology tutor persona + no-prediction guardrail)
9
+ - Format retrieved context chunks into the prompt
10
+ - Call the Groq chat completion endpoint and return the answer string
11
+ """
12
+
13
+ import logging
14
+
15
+ from groq import Groq
16
+ from langchain_core.documents import Document
17
+
18
+ from config import cfg
19
+
20
+ logger = logging.getLogger(__name__)
21
+
22
+ # ── System prompt ─────────────────────────────────────────────────────────────
23
+ # Defines the bot's persona, scope, and hard guardrails.
24
+
25
+ SYSTEM_TEMPLATE = """You are AstroBot, a patient and knowledgeable astrology tutor.
26
+ Your students are learning astrology concepts. Your role is to:
27
+ β€’ Explain astrological concepts clearly and accurately using the provided context.
28
+ β€’ Use analogies and examples to make complex ideas approachable.
29
+ β€’ Reference classical and modern astrology where relevant.
30
+ β€’ Encourage curiosity and deeper study.
31
+
32
+ HARD RULES β€” never break these:
33
+ 1. Do NOT make personal predictions or interpret anyone's birth chart.
34
+ 2. Do NOT speculate about future events for specific individuals.
35
+ 3. If the context does not contain enough information to answer, say so honestly
36
+ and suggest the student consult a textbook or senior practitioner.
37
+ 4. Keep answers focused on educational content only.
38
+
39
+ --- CONTEXT FROM COURSE MATERIALS ---
40
+ {context}
41
+ --- END OF CONTEXT ---
42
+
43
+ Answer the student's question based solely on the context above.
44
+ If the answer isn't in the context, say: "I don't have that in my course materials right now β€”
45
+ let me point you to further study resources."
46
+ """
47
+
48
+
49
+ # ── Public API ────────────────────────────────────────────────────────────────
50
+
51
+ def create_client() -> Groq:
52
+ """
53
+ Initialise and validate the Groq client.
54
+
55
+ Raises
56
+ ------
57
+ ValueError
58
+ If GROQ_API_KEY is missing.
59
+ """
60
+ if not cfg.groq_api_key:
61
+ raise ValueError(
62
+ "GROQ_API_KEY is not set. Add it in Space β†’ Settings β†’ Repository secrets."
63
+ )
64
+ logger.info("Groq client initialised (model: %s)", cfg.groq_model)
65
+ return Groq(api_key=cfg.groq_api_key)
66
+
67
+
68
+ def generate_answer(client: Groq, query: str, context_docs: list[Document]) -> str:
69
+ """
70
+ Build the RAG prompt and call Groq to get an answer.
71
+
72
+ Parameters
73
+ ----------
74
+ client : Groq
75
+ Groq client returned by create_client().
76
+ query : str
77
+ The student's question.
78
+ context_docs : list[Document]
79
+ Retrieved chunks from the vector store.
80
+
81
+ Returns
82
+ -------
83
+ str
84
+ The model's answer string.
85
+ """
86
+ context_text = _format_context(context_docs)
87
+ system_prompt = SYSTEM_TEMPLATE.format(context=context_text)
88
+
89
+ logger.debug("Calling Groq | model=%s | context_chunks=%d", cfg.groq_model, len(context_docs))
90
+
91
+ response = client.chat.completions.create(
92
+ model=cfg.groq_model,
93
+ messages=[
94
+ {"role": "system", "content": system_prompt},
95
+ {"role": "user", "content": query},
96
+ ],
97
+ temperature=cfg.groq_temperature,
98
+ max_tokens=cfg.groq_max_tokens,
99
+ )
100
+
101
+ answer = response.choices[0].message.content
102
+ logger.debug("Groq response: %d chars", len(answer))
103
+ return answer
104
+
105
+
106
+ # ── Internal helpers ──────────────────────────────────────────────────────────
107
+
108
+ def _format_context(docs: list[Document]) -> str:
109
+ """
110
+ Format retrieved documents into a numbered context block
111
+ that is easy for the LLM to parse.
112
+ """
113
+ blocks = []
114
+ for i, doc in enumerate(docs, 1):
115
+ source = doc.metadata.get("source", doc.metadata.get("source_row", i))
116
+ page = doc.metadata.get("page", "")
117
+ header = f"[Source {i}" + (f" | {source}" if source else "") + (f" | p.{page}" if page else "") + "]"
118
+ blocks.append(f"{header}\n{doc.page_content}")
119
+ return "\n\n".join(blocks)
rag_pipeline.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ rag_pipeline.py
3
+ ───────────────
4
+ Orchestrates the full RAG pipeline: query β†’ retrieve β†’ generate β†’ answer.
5
+
6
+ This module is the single integration point between the vector store and
7
+ the LLM. The UI layer (app.py) calls only this module; it knows nothing
8
+ about FAISS or Groq directly.
9
+
10
+ Pipeline steps
11
+ ──────────────
12
+ 1. Validate query (non-empty, reasonable length)
13
+ 2. Retrieve top-k relevant chunks from FAISS
14
+ 3. Pass chunks + query to the LLM for grounded generation
15
+ 4. Return the answer (and optionally the source snippets for transparency)
16
+ """
17
+
18
+ import logging
19
+ from dataclasses import dataclass
20
+
21
+ from groq import Groq
22
+ from langchain_community.vectorstores import FAISS
23
+ from langchain_core.documents import Document
24
+
25
+ import llm as llm_module
26
+ import vector_store as vs_module
27
+ from config import cfg
28
+
29
+ logger = logging.getLogger(__name__)
30
+
31
+ MAX_QUERY_LENGTH = 1000 # characters
32
+
33
+
34
+ # ── Data classes ──────────────────────────────────────────────────────────────
35
+
36
+ @dataclass
37
+ class RAGResponse:
38
+ answer: str
39
+ sources: list[Document]
40
+ query: str
41
+
42
+ def format_sources(self) -> str:
43
+ """Return a compact source-citation string for display in the UI."""
44
+ if not self.sources:
45
+ return ""
46
+ lines = []
47
+ for i, doc in enumerate(self.sources, 1):
48
+ src = doc.metadata.get("source", "")
49
+ page = doc.metadata.get("page", "")
50
+ snippet = doc.page_content[:120].replace("\n", " ") + "…"
51
+ label = f"**[{i}]**"
52
+ if src:
53
+ label += f" {src}"
54
+ if page:
55
+ label += f" p.{page}"
56
+ lines.append(f"{label}: _{snippet}_")
57
+ return "\n".join(lines)
58
+
59
+
60
+ # ── Pipeline class ────────────────────────────────────────────────────────────
61
+
62
+ class RAGPipeline:
63
+ """
64
+ Stateful pipeline object. Instantiated once at app startup and reused
65
+ for every student query throughout the session.
66
+ """
67
+
68
+ def __init__(self, index: FAISS, groq_client: Groq) -> None:
69
+ self._index = index
70
+ self._client = groq_client
71
+ logger.info("RAGPipeline ready βœ“")
72
+
73
+ # ── Public ────────────────────────────────────────────────────────────────
74
+
75
+ def query(self, user_query: str) -> RAGResponse:
76
+ """
77
+ Run the full RAG pipeline for a single student question.
78
+
79
+ Parameters
80
+ ----------
81
+ user_query : str
82
+ Raw question text from the student.
83
+
84
+ Returns
85
+ -------
86
+ RAGResponse
87
+ Contains the answer string and the source Documents used.
88
+ """
89
+ validated = self._validate_query(user_query)
90
+ if validated is None:
91
+ return RAGResponse(
92
+ answer="Please enter a valid question (non-empty, under 1000 characters).",
93
+ sources=[],
94
+ query=user_query,
95
+ )
96
+
97
+ logger.info("Processing query: '%s'", validated[:80])
98
+
99
+ # Step 1 β€” Retrieve
100
+ context_docs = vs_module.retrieve(self._index, validated, k=cfg.top_k)
101
+
102
+ # Step 2 β€” Generate
103
+ answer = llm_module.generate_answer(self._client, validated, context_docs)
104
+
105
+ return RAGResponse(answer=answer, sources=context_docs, query=validated)
106
+
107
+ # ── Internal ──────────────────────────────────────────────────────────────
108
+
109
+ @staticmethod
110
+ def _validate_query(query: str) -> str | None:
111
+ """Return the stripped query if valid, else None."""
112
+ stripped = query.strip()
113
+ if not stripped or len(stripped) > MAX_QUERY_LENGTH:
114
+ return None
115
+ return stripped
116
+
117
+
118
+ # ── Factory function ─────────────────────────────────────────────────────────
119
+
120
+ def build_pipeline() -> RAGPipeline:
121
+ """
122
+ Convenience factory: load data, build index, init LLM, return pipeline.
123
+ Import and call this once from app.py.
124
+ """
125
+ from data_loader import load_documents # local import avoids circular deps
126
+
127
+ logger.info("=== Building AstroBot RAG Pipeline ===")
128
+
129
+ docs = load_documents()
130
+ index = vs_module.build_index(docs)
131
+ client = llm_module.create_client()
132
+ pipeline = RAGPipeline(index=index, groq_client=client)
133
+
134
+ logger.info("=== AstroBot Pipeline Ready βœ“ ===")
135
+ return pipeline
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ groq>=0.5.0
2
+ gradio>=4.36.0
3
+ datasets>=2.19.0
4
+ pandas>=2.0.0
5
+ langchain>=0.2.0
6
+ langchain-core>=0.2.0
7
+ langchain-community>=0.2.0
8
+ faiss-cpu>=1.8.0
9
+ sentence-transformers>=3.0.0
10
+ huggingface-hub>=0.23.0
11
+ langchain-text-splitters>=0.2.0
vector_store.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ vector_store.py
3
+ ───────────────
4
+ Handles text chunking, embedding, and FAISS vector store creation/querying.
5
+
6
+ Responsibilities:
7
+ - Split raw Documents into overlapping chunks
8
+ - Embed chunks using a local HuggingFace sentence-transformer
9
+ - Build and expose a FAISS index for similarity search
10
+ - Provide a clean retrieve() function used by the RAG pipeline
11
+ """
12
+
13
+ import logging
14
+
15
+ from langchain_core.documents import Document
16
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
17
+ from langchain_community.embeddings import HuggingFaceEmbeddings
18
+ from langchain_community.vectorstores import FAISS
19
+
20
+ from config import cfg
21
+
22
+ logger = logging.getLogger(__name__)
23
+
24
+
25
+ # ── Public API ────────────────────────────────────────────────────────────────
26
+
27
+ def build_index(documents: list[Document]) -> FAISS:
28
+ """
29
+ Chunk β†’ embed β†’ index the supplied documents.
30
+
31
+ Parameters
32
+ ----------
33
+ documents : list[Document]
34
+ Raw documents returned by data_loader.load_documents().
35
+
36
+ Returns
37
+ -------
38
+ FAISS
39
+ A ready-to-query FAISS vector store.
40
+ """
41
+ chunks = _chunk_documents(documents)
42
+ embeddings = _load_embeddings()
43
+ index = _create_faiss_index(chunks, embeddings)
44
+ return index
45
+
46
+
47
+ def retrieve(index: FAISS, query: str, k: int | None = None) -> list[Document]:
48
+ """
49
+ Retrieve the top-k most relevant chunks for a given query.
50
+
51
+ Parameters
52
+ ----------
53
+ index : FAISS
54
+ The FAISS vector store built by build_index().
55
+ query : str
56
+ The user's natural-language question.
57
+ k : int, optional
58
+ Number of results to return. Defaults to cfg.top_k.
59
+
60
+ Returns
61
+ -------
62
+ list[Document]
63
+ Retrieved chunks, most relevant first.
64
+ """
65
+ k = k or cfg.top_k
66
+ results = index.similarity_search(query, k=k)
67
+ logger.debug("Retrieved %d chunks for query: '%s'", len(results), query[:80])
68
+ return results
69
+
70
+
71
+ # ── Internal helpers ──────────────────────────────────────────────────────────
72
+
73
+ def _chunk_documents(documents: list[Document]) -> list[Document]:
74
+ """Split documents into smaller overlapping chunks."""
75
+ splitter = RecursiveCharacterTextSplitter(
76
+ chunk_size=cfg.chunk_size,
77
+ chunk_overlap=cfg.chunk_overlap,
78
+ separators=["\n\n", "\n", ". ", " ", ""],
79
+ )
80
+ chunks = splitter.split_documents(documents)
81
+ logger.info(
82
+ "Chunking: %d raw docs β†’ %d chunks (size=%d, overlap=%d)",
83
+ len(documents), len(chunks), cfg.chunk_size, cfg.chunk_overlap,
84
+ )
85
+ return chunks
86
+
87
+
88
+ def _load_embeddings() -> HuggingFaceEmbeddings:
89
+ """Load the local sentence-transformer embedding model (cached after first call)."""
90
+ logger.info("Loading embedding model: %s", cfg.embed_model)
91
+ return HuggingFaceEmbeddings(
92
+ model_name=cfg.embed_model,
93
+ model_kwargs={"device": "cpu"},
94
+ encode_kwargs={"normalize_embeddings": True},
95
+ )
96
+
97
+
98
+ def _create_faiss_index(chunks: list[Document], embeddings: HuggingFaceEmbeddings) -> FAISS:
99
+ """Embed all chunks and build the FAISS index."""
100
+ logger.info("Building FAISS index over %d chunks …", len(chunks))
101
+ index = FAISS.from_documents(chunks, embeddings)
102
+ logger.info("FAISS index built βœ“ (vectors: %d)", index.index.ntotal)
103
+ return index