Spaces:

OnlyTheTruth03
/

DemoChatBot

Sleeping

App Files Files Community

OnlyTheTruth03 commited on Feb 19

Commit

721ca73

verified ·

1 Parent(s): 190236c

Initial Commit

Browse files

Files changed (10) hide show

PROJECT.md +277 -0
README.md +21 -13
app.py +221 -0
config.py +46 -0
convert_pdfs.py +46 -0
data_loader.py +105 -0
llm.py +119 -0
rag_pipeline.py +135 -0
requirements.txt +11 -0
vector_store.py +103 -0

PROJECT.md ADDED Viewed

	@@ -0,0 +1,277 @@

+# 🔭 AstroBot — RAG-Powered educational AI System
+>AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A.
+>It demonstrates:
+>End-to-end PDF ingestion → structured Parquet datasets
+>Semantic indexing with FAISS
+>Context-grounded LLM responses via Groq (LLaMA-3)
+>Modular architecture enabling easy LLM or vector DB swapping
+>Public deployment on Hugging Face Spaces (CI/CD via git push)
+---
+## Table of Contents
+1. [Project Overview](#project-overview)
+2. [Tech Stack](#tech-stack)
+3. [Architecture](#architecture)
+4. [File Structure](#file-structure)
+5. [Module Responsibilities](#module-responsibilities)
+6. [Data Pipeline](#data-pipeline)
+7. [Setup & Deployment](#setup--deployment)
+8. [Environment Variables](#environment-variables)
+9. [How to Add New Course Materials](#how-to-add-new-course-materials)
+10. [Limitations & Guardrails](#limitations--guardrails)
+11. [Troubleshooting](#troubleshooting)
+---
+## Project Overview
+AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**.
+It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.
+## Tech Stack
+| Layer | Technology | Why |
+|---|---|---|
+| LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous |
+| Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space |
+| Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally |
+| Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well |
+| Framework | **LangChain** | Chunking utilities and Document schema |
+| UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly |
+| Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push |
+### What it does
+- Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
+- Grounds every answer in actual course material (no hallucination of unsupported facts).
+- Clearly declines to make personal predictions or interpret individual birth charts.
+### What it does NOT do
+- Make predictions of any kind.
+- Interpret a specific person's chart.
+- Answer questions unrelated to astrology concepts.
+---
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        OFFLINE (once)                           │
+│                                                                 │
+│  Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet)   │
+└─────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    HF SPACE (at startup)                        │
+│                                                                 │
+│  data_loader.py                                                 │
+│    └── load_dataset() from HF Hub ──► list[Document]           │
+│                                                                 │
+│  vector_store.py                                                │
+│    ├── RecursiveCharacterTextSplitter ──► Chunks                │
+│    ├── HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors           │
+│    └── FAISS.from_documents() ──► Index                        │
+│                                                                 │
+│  llm.py                                                         │
+│    └── Groq(api_key) ──► Groq Client                           │
+│                                                                 │
+│  rag_pipeline.py                                                │
+│    └── RAGPipeline(index, groq_client) ──► Ready               │
+└─────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌─────────────────────────────────────────���───────────────────────┐
+│                    HF SPACE (per query)                         │
+│                                                                 │
+│  Student Question                                               │
+│       │                                                         │
+│       ▼                                                         │
+│  rag_pipeline.query()                                           │
+│       ├── vector_store.retrieve()  ──► Top-K Chunks            │
+│       └── llm.generate_answer()   ──► Grounded Answer          │
+│                                                                 │
+│  app.py  ──►  Gradio UI  ──►  Student sees answer              │
+└─────────────────────────────────────────────────────────────────┘
+```
+---
+## File Structure
+```
+astrobot/
+│
+├── app.py              # Gradio UI — entry point for HF Spaces
+├── config.py           # All configuration (env vars, hyperparameters)
+├── data_loader.py      # HF dataset fetching + Document creation
+├── vector_store.py     # Chunking, embedding, FAISS index
+├── llm.py              # Groq client + prompt engineering
+├── rag_pipeline.py     # Orchestrates retrieval → generation
+│
+├── convert_pdfs.py     # Offline helper: PDFs → HF Parquet dataset
+├── requirements.txt    # Python dependencies
+└── PROJECT.md          # This file
+```
+---
+## Module Responsibilities
+| Module | Single Responsibility |
+|---|---|
+| `config.py` | Central source of truth for all settings. Change a parameter once here. |
+| `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. |
+| `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. |
+| `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. |
+| `rag_pipeline.py` | Glue layer: validate query → retrieve → generate → return `RAGResponse`. |
+| `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. |
+| `convert_pdfs.py` | One-time offline script: extract PDF pages → push Parquet to HF Hub. |
+This separation means:
+- You can swap **FAISS → Pinecone** by editing only `vector_store.py`.
+- You can swap **Groq → OpenAI** by editing only `llm.py`.
+- You can change the **system prompt** (persona, guardrails) in only `llm.py`.
+- You can replace the **UI** without touching any backend logic.
+---
+## Data Pipeline
+### Step 1 — Prepare your PDFs (run locally)
+Place your astrology textbook PDFs in a folder and run:
+```bash
+pip install pypdf datasets huggingface-hub
+python convert_pdfs.py \
+    --pdf_dir  ./astrology_books \
+    --repo_id  YOUR_USERNAME/astrology-course-materials \
+    --private          # optional
+```
+This will:
+1. Extract text from each PDF page-by-page.
+2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`.
+3. Push it to HF Hub as a Parquet-backed dataset.
+### Step 2 — Connect to the Space
+Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below).
+### Step 3 — What happens at startup
+```
+load_dataset()                   # ~30s for large datasets
+RecursiveCharacterTextSplitter   # chunk_size=512, overlap=64
+HuggingFaceEmbeddings            # ~60s to encode all chunks
+FAISS.from_documents()           # <5s
+```
+The index is built once per Space restart and held in memory.
+---
+## Setup & Deployment
+### 1. Create a Hugging Face Space
+- Go to [huggingface.co/new-space](https://huggingface.co/new-space)
+- **SDK:** Gradio
+- **Hardware:** CPU Basic (free)
+### 2. Upload files
+Upload these files to the Space repository:
+```
+app.py
+config.py
+data_loader.py
+vector_store.py
+llm.py
+rag_pipeline.py
+requirements.txt
+```
+### 3. Set secrets
+Go to **Space → Settings → Repository secrets → New secret**
+| Secret Name | Value |
+|---|---|
+| `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) → API Keys |
+| `HF_DATASET` | `your-username/your-dataset-name` |
+| `HF_TOKEN` | Your HF token (only needed for **private** datasets) |
+### 4. Done
+The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).
+---
+## Environment Variables
+All variables are read in `config.py`. You can also set them locally for development:
+```bash
+export GROQ_API_KEY="gsk_..."
+export HF_DATASET="yourname/astrology-course-materials"
+export HF_TOKEN=""          # leave blank for public datasets
+python app.py
+```
+---
+## How to Add New Course Materials
+1. Add the new PDF(s) to your `./astrology_books/` folder.
+2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset).
+3. **Restart the HF Space** — it will re-index on next startup.
+No code changes required.
+---
+## Limitations & Guardrails
+| Limitation | Detail |
+|---|---|
+| **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. |
+| **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. |
+| **No chart interpretation** | Questions about specific birth charts are declined. |
+| **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3–5 min cold start). |
+| **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. |
+| **Language** | Optimised for English. Other languages may work but are untested. |
+---
+## Troubleshooting
+### Space fails to start
+- Check **Logs** tab in the Space for Python errors.
+- Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`).
+### "GROQ_API_KEY is not set"
+- Add the secret in Space → Settings → Repository secrets.
+### "No usable text column found"
+- Your Parquet dataset doesn't have a column named `text`, `content`, etc.
+- Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`.
+### Answers seem unrelated to the question
+- Increase `TOP_K` in `config.py` (try 7–10).
+- Decrease `CHUNK_SIZE` (try 256) for finer granularity.
+- Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.
+### Groq rate limit errors
+- Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.
+---

README.md CHANGED Viewed

@@ -1,13 +1,21 @@
----
-title: DemoChatBot
-emoji: ⚡
-colorFrom: gray
-colorTo: gray
-sdk: gradio
-sdk_version: 6.6.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: Astrobot
+emoji: 📈
+colorFrom: pink
+colorTo: green
+sdk: gradio
+sdk_version: 6.6.0
+app_file: app.py
+pinned: false
+license: apache-2.0
+---
+# 🔭 AstroBot
+**RAG-powered astrology tutor for students.**
+Ask about planets, houses, signs, aspects, transits — grounded in your course materials.
+> 📚 Explains concepts only · No personal predictions · No chart readings
+See [PROJECT.md](PROJECT.md) for full documentation.
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,221 @@

+"""
+app.py
+──────
+Gradio UI — the entry point for Hugging Face Spaces.
+This module ONLY handles UI concerns:
+  - Layout and theming
+  - Wiring user inputs to the RAG pipeline
+  - Displaying answers and source citations
+  - Error handling / friendly messages
+It delegates ALL logic to rag_pipeline.py.
+"""
+import logging
+import sys
+import gradio as gr
+from config import cfg
+from rag_pipeline import RAGPipeline, build_pipeline
+# ── Gradio version guard ──────────────────────────────────────────────────────
+# Detect which optional Chatbot kwargs are available in the installed version.
+import inspect as _inspect
+_chatbot_params = set(_inspect.signature(gr.Chatbot.__init__).parameters)
+_SUPPORTS_COPY   = "show_copy_button"  in _chatbot_params
+_SUPPORTS_BUBBLE = "bubble_full_width" in _chatbot_params
+# ── Logging setup ─────────────────────────────────────────────────────────────
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
+    handlers=[logging.StreamHandler(sys.stdout)],
+)
+logger = logging.getLogger(__name__)
+# ── Pipeline (initialised once at startup) ────────────────────────────────────
+pipeline: RAGPipeline | None = None
+init_error: str | None       = None
+try:
+    pipeline = build_pipeline()
+except Exception as exc:
+    init_error = str(exc)
+    logger.exception("Pipeline initialisation failed: %s", exc)
+# ── Chat handler ──────────────────────────────────────────────────────────────
+def _msg(role: str, content: str) -> dict:
+    """Return a Gradio-compatible message dict."""
+    return {"role": role, "content": content}
+def chat(user_message: str, history: list, show_sources: bool):
+    """
+    Called by Gradio on every user message.
+    Parameters
+    ----------
+    user_message : str
+    history : list
+        Gradio chat history — list of {"role": ..., "content": ...} dicts.
+    show_sources : bool
+        Whether to append source citations below the answer.
+    Returns
+    -------
+    tuple[str, list, str]
+        (cleared input, updated history, sources markdown)
+    """
+    if init_error:
+        bot_reply = f"⚠️ **Setup error:** {init_error}\n\nPlease check your Space secrets and logs."
+        history = history + [_msg("user", user_message), _msg("assistant", bot_reply)]
+        return "", history, ""
+    if not user_message.strip():
+        return "", history, ""
+    try:
+        response   = pipeline.query(user_message)  # type: ignore[union-attr]
+        bot_reply  = response.answer
+        sources_md = response.format_sources() if show_sources else ""
+    except Exception as exc:
+        logger.exception("Error during query: %s", exc)
+        bot_reply  = "🔭 Something went wrong while consulting the stars. Please try again."
+        sources_md = ""
+    history = history + [_msg("user", user_message), _msg("assistant", bot_reply)]
+    return "", history, sources_md
+# ── Gradio UI ─────────────────────────────────────────────────────────────────
+CSS = """
+/* AstroBot custom styles */
+body, .gradio-container { font-family: 'Georgia', serif; }
+.title-banner { text-align: center; padding: 1rem 0 0.5rem; }
+.title-banner h1 { font-size: 2rem; letter-spacing: 0.04em; }
+.disclaimer {
+    background: #1a1a2e; color: #a0aec0; border-radius: 8px;
+    padding: 0.6rem 1rem; font-size: 0.82rem; margin-bottom: 0.5rem;
+}
+.sources-box { font-size: 0.82rem; color: #718096; }
+footer { display: none !important; }
+"""
+EXAMPLE_QUESTIONS = [
+    "What is the difference between the Sun sign and Rising sign?",
+    "Explain what retrograde motion means for planets.",
+    "What are the 12 houses in a birth chart?",
+    "How do I interpret a conjunction aspect?",
+    "What does it mean when Mars is in Aries?",
+    "Explain the concept of planetary dignities and debilities.",
+    "What is the difference between sidereal and tropical zodiac?",
+    "How does the Moon sign influence emotions?",
+]
+# ── Gradio version-safe theme ─────────────────────────────────────────────────
+_SUPPORTS_THEMES = hasattr(gr, "themes") and hasattr(gr.themes, "Base")
+_theme = gr.themes.Base(
+    primary_hue="indigo",
+    secondary_hue="purple",
+    neutral_hue="slate",
+) if _SUPPORTS_THEMES else None
+with gr.Blocks(
+    title=cfg.app_title,
+    theme=_theme,
+    css=CSS,
+) as demo:
+    # ── Header ────────────────────────────────────────────────────────────────
+    gr.HTML(
+        """
+        <div class="title-banner">
+            <h1>🔭 AstroBot</h1>
+            <p style="color:#9b8ec4; font-size:1.05rem;">
+                Your AI Astrology Tutor · Powered by Groq LLaMA-3.1-8b-instant
+            </p>
+        </div>
+        """
+    )
+    gr.HTML(
+        """
+        <div class="disclaimer">
+            📚 <strong>For students only.</strong>
+            AstroBot explains astrological <em>concepts</em> drawn from your course materials.
+            It does <strong>not</strong> make personal predictions or interpret individual birth charts.
+        </div>
+        """
+    )
+    # ── Main layout ───────────────────────────────────────────────────────────
+    with gr.Row():
+        with gr.Column(scale=3):
+            _chatbot_kwargs = {"label": "AstroBot", "height": 500}
+            if _SUPPORTS_BUBBLE:
+                _chatbot_kwargs["bubble_full_width"] = False
+            if _SUPPORTS_COPY:
+                _chatbot_kwargs["show_copy_button"] = True
+            if "type" in _chatbot_params:
+                _chatbot_kwargs["type"] = "messages"   # role/content dict format
+            chatbot = gr.Chatbot(**_chatbot_kwargs)
+            with gr.Row():
+                txt_input = gr.Textbox(
+                    placeholder="Ask a concept question about astrology…",
+                    show_label=False,
+                    scale=9,
+                )
+                send_btn = gr.Button("Ask ✨", variant="primary", scale=1)
+        with gr.Column(scale=1):
+            gr.Markdown("### ⚙️ Options")
+            _checkbox_kwargs = {
+                "label": "Show source excerpts",
+                "value": False,
+            }
+            _checkbox_params = set(_inspect.signature(gr.Checkbox.__init__).parameters)
+            if "info" in _checkbox_params:
+                _checkbox_kwargs["info"] = "Display the course material passages used to answer."
+            show_sources = gr.Checkbox(**_checkbox_kwargs)
+            gr.Markdown("### 💡 Example Questions")
+            for q in EXAMPLE_QUESTIONS:
+                gr.Button(q, size="sm").click(
+                    fn=lambda x=q: x, outputs=txt_input
+                )
+    # ── Source citations panel ────────────────────────────────────────────────
+    sources_display = gr.Markdown(
+        value="",
+        label="Source Excerpts",
+        elem_classes=["sources-box"],
+    )
+    # ── State ────────────────────────────────────────────────────────────────
+    state = gr.State([])
+    # ── Event wiring ──────────────────────────────────────────────────────────
+    send_btn.click(
+        fn=chat,
+        inputs=[txt_input, state, show_sources],
+        outputs=[txt_input, chatbot, sources_display],
+    )
+    txt_input.submit(
+        fn=chat,
+        inputs=[txt_input, state, show_sources],
+        outputs=[txt_input, chatbot, sources_display],
+    )
+    # ── Footer ────────────────────────────────────────────────────────────────
+    gr.Markdown(
+        "_Built with [Groq](https://groq.com) · [LangChain](https://langchain.com) · "
+        "[Hugging Face](https://huggingface.co) — for astrology students everywhere 🌙_"
+    )
+# ── Entry point ───────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

config.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+config.py
+─────────
+Central configuration for the AstroBot RAG application.
+All tuneable parameters live here — change once, affects everywhere.
+"""
+import os
+from dataclasses import dataclass, field
+@dataclass
+class AppConfig:
+    # ── Groq LLM ──────────────────────────────────────────────────────────────
+    groq_api_key: str         = field(default_factory=lambda: os.environ.get("GROQ_API_KEY", ""))
+    groq_model: str           = "llama-3.1-8b-instant"
+    groq_temperature: float   = 0.2
+    groq_max_tokens: int      = 1024
+    # ── Hugging Face Dataset ───────────────────────────────────────────────────
+    hf_dataset: str           = field(default_factory=lambda: os.environ.get("HF_DATASET", ""))
+    hf_token: str             = field(default_factory=lambda: os.environ.get("HF_TOKEN", ""))
+    dataset_split: str        = "train"
+    # Ordered list of candidate column names that hold the raw text
+    text_column_candidates: list = field(default_factory=lambda: [
+        "text", "content", "body", "page_content", "extracted_text"
+    ])
+    # ── Embeddings & Retrieval ─────────────────────────────────────────────────
+    embed_model: str          = "sentence-transformers/all-MiniLM-L6-v2"
+    chunk_size: int           = 512
+    chunk_overlap: int        = 64
+    top_k: int                = 5
+    # ── App Meta ───────────────────────────────────────────────────────────────
+    app_title: str            = "🔭 AstroBot — Astrology Learning Assistant"
+    app_description: str      = (
+        "Ask me anything about astrology concepts — planets, houses, aspects, "
+        "signs, transits, chart reading, and more. "
+        "**Note:** This bot explains concepts only; no personal predictions are made."
+    )
+# Singleton — import this everywhere
+cfg = AppConfig()

convert_pdfs.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+config.py
+─────────
+Central configuration for the AstroBot RAG application.
+All tuneable parameters live here — change once, affects everywhere.
+"""
+import os
+from dataclasses import dataclass, field
+@dataclass
+class AppConfig:
+    # ── Groq LLM ──────────────────────────────────────────────────────────────
+    groq_api_key: str         = field(default_factory=lambda: os.environ.get("GROQ_API_KEY", ""))
+    groq_model: str           = "3.1-8b-instant"
+    groq_temperature: float   = 0.2
+    groq_max_tokens: int      = 1024
+    # ── Hugging Face Dataset ───────────────────────────────────────────────────
+    hf_dataset: str           = field(default_factory=lambda: os.environ.get("HF_DATASET", ""))
+    hf_token: str             = field(default_factory=lambda: os.environ.get("HF_TOKEN", ""))
+    dataset_split: str        = "train"
+    # Ordered list of candidate column names that hold the raw text
+    text_column_candidates: list = field(default_factory=lambda: [
+        "text", "content", "body", "page_content", "extracted_text"
+    ])
+    # ── Embeddings & Retrieval ─────────────────────────────────────────────────
+    embed_model: str          = "sentence-transformers/all-MiniLM-L6-v2"
+    chunk_size: int           = 512
+    chunk_overlap: int        = 64
+    top_k: int                = 5
+    # ── App Meta ───────────────────────────────────────────────────────────────
+    app_title: str            = "🔭 AstroBot — Astrology Learning Assistant"
+    app_description: str      = (
+        "Ask me anything about astrology concepts — planets, houses, aspects, "
+        "signs, transits, chart reading, and more. "
+        "**Note:** This bot explains concepts only; no personal predictions are made."
+    )
+# Singleton — import this everywhere
+cfg = AppConfig()

data_loader.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""
+data_loader.py
+──────────────
+Loads the Parquet-backed PDF dataset from Hugging Face Hub and returns
+a list of LangChain Document objects ready for indexing.
+Responsibilities:
+  - Connect to HF Hub (handles both public and private datasets)
+  - Auto-detect the text column
+  - Yield Document objects with rich metadata (source file, page number, etc.)
+"""
+import logging
+from typing import Optional
+import pandas as pd
+from datasets import load_dataset
+from langchain_core.documents import Document
+from config import cfg
+logger = logging.getLogger(__name__)
+# ── Public API ────────────────────────────────────────────────────────────────
+def load_documents() -> list[Document]:
+    """
+    Entry point: load HF dataset and return chunked-ready Document objects.
+    Returns
+    -------
+    list[Document]
+        One Document per non-empty row, with metadata preserved.
+    Raises
+    ------
+    ValueError
+        If the dataset is not configured or no usable text column is found.
+    """
+    if not cfg.hf_dataset:
+        raise ValueError(
+            "HF_DATASET env var is not set. "
+            "Set it to 'username/dataset-name' in your Space secrets."
+        )
+    df = _fetch_dataframe()
+    text_col = _detect_text_column(df)
+    documents = _build_documents(df, text_col)
+    logger.info("Loaded %d documents from '%s' (column: '%s')",
+                len(documents), cfg.hf_dataset, text_col)
+    return documents
+# ── Internal helpers ──────────────────────────────────────────────────────────
+def _fetch_dataframe() -> pd.DataFrame:
+    """Download the dataset split from HF Hub and return as a DataFrame."""
+    logger.info("Fetching dataset '%s' split='%s' …", cfg.hf_dataset, cfg.dataset_split)
+    ds = load_dataset(
+        cfg.hf_dataset,
+        split=cfg.dataset_split,
+        token=cfg.hf_token or None,
+    )
+    df = ds.to_pandas()
+    logger.info("Dataset shape: %s | columns: %s", df.shape, df.columns.tolist())
+    return df
+def _detect_text_column(df: pd.DataFrame) -> str:
+    """
+    Find the first column whose lowercase name matches a known text-column
+    name. Falls back to the first column if none match.
+    """
+    col_lower = {c.lower(): c for c in df.columns}
+    for candidate in cfg.text_column_candidates:
+        if candidate in col_lower:
+            return col_lower[candidate]
+    fallback = df.columns[0]
+    logger.warning(
+        "No known text column found. Falling back to '%s'. "
+        "Expected one of: %s",
+        fallback, cfg.text_column_candidates,
+    )
+    return fallback
+def _build_documents(df: pd.DataFrame, text_col: str) -> list[Document]:
+    """Convert DataFrame rows into LangChain Document objects with metadata."""
+    meta_cols = [c for c in df.columns if c != text_col]
+    documents: list[Document] = []
+    for row_idx, row in df.iterrows():
+        text = str(row[text_col]).strip()
+        if not text or text.lower() == "nan":
+            continue  # skip empty rows
+        metadata = {col: str(row.get(col, "")) for col in meta_cols}
+        metadata["source_row"] = int(row_idx)  # type: ignore[arg-type]
+        documents.append(Document(page_content=text, metadata=metadata))
+    return documents

llm.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+llm.py
+──────
+Wraps the Groq API client and owns all prompt engineering for AstroBot.
+Responsibilities:
+  - Validate the Groq API key at startup
+  - Build the system prompt (astrology tutor persona + no-prediction guardrail)
+  - Format retrieved context chunks into the prompt
+  - Call the Groq chat completion endpoint and return the answer string
+"""
+import logging
+from groq import Groq
+from langchain_core.documents import Document
+from config import cfg
+logger = logging.getLogger(__name__)
+# ── System prompt ─────────────────────────────────────────────────────────────
+# Defines the bot's persona, scope, and hard guardrails.
+SYSTEM_TEMPLATE = """You are AstroBot, a patient and knowledgeable astrology tutor.
+Your students are learning astrology concepts. Your role is to:
+  • Explain astrological concepts clearly and accurately using the provided context.
+  • Use analogies and examples to make complex ideas approachable.
+  • Reference classical and modern astrology where relevant.
+  • Encourage curiosity and deeper study.
+HARD RULES — never break these:
+  1. Do NOT make personal predictions or interpret anyone's birth chart.
+  2. Do NOT speculate about future events for specific individuals.
+  3. If the context does not contain enough information to answer, say so honestly
+     and suggest the student consult a textbook or senior practitioner.
+  4. Keep answers focused on educational content only.
+--- CONTEXT FROM COURSE MATERIALS ---
+{context}
+--- END OF CONTEXT ---
+Answer the student's question based solely on the context above.
+If the answer isn't in the context, say: "I don't have that in my course materials right now —
+let me point you to further study resources."
+"""
+# ── Public API ────────────────────────────────────────────────────────────────
+def create_client() -> Groq:
+    """
+    Initialise and validate the Groq client.
+    Raises
+    ------
+    ValueError
+        If GROQ_API_KEY is missing.
+    """
+    if not cfg.groq_api_key:
+        raise ValueError(
+            "GROQ_API_KEY is not set. Add it in Space → Settings → Repository secrets."
+        )
+    logger.info("Groq client initialised (model: %s)", cfg.groq_model)
+    return Groq(api_key=cfg.groq_api_key)
+def generate_answer(client: Groq, query: str, context_docs: list[Document]) -> str:
+    """
+    Build the RAG prompt and call Groq to get an answer.
+    Parameters
+    ----------
+    client : Groq
+        Groq client returned by create_client().
+    query : str
+        The student's question.
+    context_docs : list[Document]
+        Retrieved chunks from the vector store.
+    Returns
+    -------
+    str
+        The model's answer string.
+    """
+    context_text = _format_context(context_docs)
+    system_prompt = SYSTEM_TEMPLATE.format(context=context_text)
+    logger.debug("Calling Groq | model=%s | context_chunks=%d", cfg.groq_model, len(context_docs))
+    response = client.chat.completions.create(
+        model=cfg.groq_model,
+        messages=[
+            {"role": "system",  "content": system_prompt},
+            {"role": "user",    "content": query},
+        ],
+        temperature=cfg.groq_temperature,
+        max_tokens=cfg.groq_max_tokens,
+    )
+    answer = response.choices[0].message.content
+    logger.debug("Groq response: %d chars", len(answer))
+    return answer
+# ── Internal helpers ──────────────────────────────────────────────────────────
+def _format_context(docs: list[Document]) -> str:
+    """
+    Format retrieved documents into a numbered context block
+    that is easy for the LLM to parse.
+    """
+    blocks = []
+    for i, doc in enumerate(docs, 1):
+        source = doc.metadata.get("source", doc.metadata.get("source_row", i))
+        page   = doc.metadata.get("page", "")
+        header = f"[Source {i}" + (f" | {source}" if source else "") + (f" | p.{page}" if page else "") + "]"
+        blocks.append(f"{header}\n{doc.page_content}")
+    return "\n\n".join(blocks)

rag_pipeline.py ADDED Viewed

	@@ -0,0 +1,135 @@

+"""
+rag_pipeline.py
+───────────────
+Orchestrates the full RAG pipeline: query → retrieve → generate → answer.
+This module is the single integration point between the vector store and
+the LLM. The UI layer (app.py) calls only this module; it knows nothing
+about FAISS or Groq directly.
+Pipeline steps
+──────────────
+  1. Validate query (non-empty, reasonable length)
+  2. Retrieve top-k relevant chunks from FAISS
+  3. Pass chunks + query to the LLM for grounded generation
+  4. Return the answer (and optionally the source snippets for transparency)
+"""
+import logging
+from dataclasses import dataclass
+from groq import Groq
+from langchain_community.vectorstores import FAISS
+from langchain_core.documents import Document
+import llm as llm_module
+import vector_store as vs_module
+from config import cfg
+logger = logging.getLogger(__name__)
+MAX_QUERY_LENGTH = 1000  # characters
+# ── Data classes ──────────────────────────────────────────────────────────────
+@dataclass
+class RAGResponse:
+    answer: str
+    sources: list[Document]
+    query: str
+    def format_sources(self) -> str:
+        """Return a compact source-citation string for display in the UI."""
+        if not self.sources:
+            return ""
+        lines = []
+        for i, doc in enumerate(self.sources, 1):
+            src  = doc.metadata.get("source", "")
+            page = doc.metadata.get("page",   "")
+            snippet = doc.page_content[:120].replace("\n", " ") + "…"
+            label = f"**[{i}]**"
+            if src:
+                label += f" {src}"
+            if page:
+                label += f" p.{page}"
+            lines.append(f"{label}: _{snippet}_")
+        return "\n".join(lines)
+# ── Pipeline class ────────────────────────────────────────────────────────────
+class RAGPipeline:
+    """
+    Stateful pipeline object. Instantiated once at app startup and reused
+    for every student query throughout the session.
+    """
+    def __init__(self, index: FAISS, groq_client: Groq) -> None:
+        self._index  = index
+        self._client = groq_client
+        logger.info("RAGPipeline ready ✓")
+    # ── Public ────────────────────────────────────────────────────────────────
+    def query(self, user_query: str) -> RAGResponse:
+        """
+        Run the full RAG pipeline for a single student question.
+        Parameters
+        ----------
+        user_query : str
+            Raw question text from the student.
+        Returns
+        -------
+        RAGResponse
+            Contains the answer string and the source Documents used.
+        """
+        validated = self._validate_query(user_query)
+        if validated is None:
+            return RAGResponse(
+                answer="Please enter a valid question (non-empty, under 1000 characters).",
+                sources=[],
+                query=user_query,
+            )
+        logger.info("Processing query: '%s'", validated[:80])
+        # Step 1 — Retrieve
+        context_docs = vs_module.retrieve(self._index, validated, k=cfg.top_k)
+        # Step 2 — Generate
+        answer = llm_module.generate_answer(self._client, validated, context_docs)
+        return RAGResponse(answer=answer, sources=context_docs, query=validated)
+    # ── Internal ──────────────────────────────────────────────────────────────
+    @staticmethod
+    def _validate_query(query: str) -> str | None:
+        """Return the stripped query if valid, else None."""
+        stripped = query.strip()
+        if not stripped or len(stripped) > MAX_QUERY_LENGTH:
+            return None
+        return stripped
+# ── Factory function ─────────────────────────────────────────────────────────
+def build_pipeline() -> RAGPipeline:
+    """
+    Convenience factory: load data, build index, init LLM, return pipeline.
+    Import and call this once from app.py.
+    """
+    from data_loader import load_documents  # local import avoids circular deps
+    logger.info("=== Building AstroBot RAG Pipeline ===")
+    docs       = load_documents()
+    index      = vs_module.build_index(docs)
+    client     = llm_module.create_client()
+    pipeline   = RAGPipeline(index=index, groq_client=client)
+    logger.info("=== AstroBot Pipeline Ready ✓ ===")
+    return pipeline

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+groq>=0.5.0
+gradio>=4.36.0
+datasets>=2.19.0
+pandas>=2.0.0
+langchain>=0.2.0
+langchain-core>=0.2.0
+langchain-community>=0.2.0
+faiss-cpu>=1.8.0
+sentence-transformers>=3.0.0
+huggingface-hub>=0.23.0
+langchain-text-splitters>=0.2.0

vector_store.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""
+vector_store.py
+───────────────
+Handles text chunking, embedding, and FAISS vector store creation/querying.
+Responsibilities:
+  - Split raw Documents into overlapping chunks
+  - Embed chunks using a local HuggingFace sentence-transformer
+  - Build and expose a FAISS index for similarity search
+  - Provide a clean retrieve() function used by the RAG pipeline
+"""
+import logging
+from langchain_core.documents import Document
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.vectorstores import FAISS
+from config import cfg
+logger = logging.getLogger(__name__)
+# ── Public API ────────────────────────────────────────────────────────────────
+def build_index(documents: list[Document]) -> FAISS:
+    """
+    Chunk → embed → index the supplied documents.
+    Parameters
+    ----------
+    documents : list[Document]
+        Raw documents returned by data_loader.load_documents().
+    Returns
+    -------
+    FAISS
+        A ready-to-query FAISS vector store.
+    """
+    chunks = _chunk_documents(documents)
+    embeddings = _load_embeddings()
+    index = _create_faiss_index(chunks, embeddings)
+    return index
+def retrieve(index: FAISS, query: str, k: int | None = None) -> list[Document]:
+    """
+    Retrieve the top-k most relevant chunks for a given query.
+    Parameters
+    ----------
+    index : FAISS
+        The FAISS vector store built by build_index().
+    query : str
+        The user's natural-language question.
+    k : int, optional
+        Number of results to return. Defaults to cfg.top_k.
+    Returns
+    -------
+    list[Document]
+        Retrieved chunks, most relevant first.
+    """
+    k = k or cfg.top_k
+    results = index.similarity_search(query, k=k)
+    logger.debug("Retrieved %d chunks for query: '%s'", len(results), query[:80])
+    return results
+# ── Internal helpers ──────────────────────────────────────────────────────────
+def _chunk_documents(documents: list[Document]) -> list[Document]:
+    """Split documents into smaller overlapping chunks."""
+    splitter = RecursiveCharacterTextSplitter(
+        chunk_size=cfg.chunk_size,
+        chunk_overlap=cfg.chunk_overlap,
+        separators=["\n\n", "\n", ". ", " ", ""],
+    )
+    chunks = splitter.split_documents(documents)
+    logger.info(
+        "Chunking: %d raw docs → %d chunks (size=%d, overlap=%d)",
+        len(documents), len(chunks), cfg.chunk_size, cfg.chunk_overlap,
+    )
+    return chunks
+def _load_embeddings() -> HuggingFaceEmbeddings:
+    """Load the local sentence-transformer embedding model (cached after first call)."""
+    logger.info("Loading embedding model: %s", cfg.embed_model)
+    return HuggingFaceEmbeddings(
+        model_name=cfg.embed_model,
+        model_kwargs={"device": "cpu"},
+        encode_kwargs={"normalize_embeddings": True},
+    )
+def _create_faiss_index(chunks: list[Document], embeddings: HuggingFaceEmbeddings) -> FAISS:
+    """Embed all chunks and build the FAISS index."""
+    logger.info("Building FAISS index over %d chunks …", len(chunks))
+    index = FAISS.from_documents(chunks, embeddings)
+    logger.info("FAISS index built ✓  (vectors: %d)", index.index.ntotal)
+    return index