# 🔭 AstroBot — RAG-Powered educational AI System >AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A. >It demonstrates: >End-to-end PDF ingestion → structured Parquet datasets >Semantic indexing with FAISS >Context-grounded LLM responses via Groq (LLaMA-3) >Modular architecture enabling easy LLM or vector DB swapping >Public deployment on Hugging Face Spaces (CI/CD via git push) --- ## Table of Contents 1. [Project Overview](#project-overview) 2. [Tech Stack](#tech-stack) 3. [Architecture](#architecture) 4. [File Structure](#file-structure) 5. [Module Responsibilities](#module-responsibilities) 6. [Data Pipeline](#data-pipeline) 7. [Setup & Deployment](#setup--deployment) 8. [Environment Variables](#environment-variables) 9. [How to Add New Course Materials](#how-to-add-new-course-materials) 10. [Limitations & Guardrails](#limitations--guardrails) 11. [Troubleshooting](#troubleshooting) --- ## Project Overview AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**. It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials. ## Tech Stack | Layer | Technology | Why | |---|---|---| | LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous | | Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space | | Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally | | Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well | | Framework | **LangChain** | Chunking utilities and Document schema | | UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly | | Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push | ### What it does - Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory. - Grounds every answer in actual course material (no hallucination of unsupported facts). - Clearly declines to make personal predictions or interpret individual birth charts. ### What it does NOT do - Make predictions of any kind. - Interpret a specific person's chart. - Answer questions unrelated to astrology concepts. --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ OFFLINE (once) │ │ │ │ Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ HF SPACE (at startup) │ │ │ │ data_loader.py │ │ └── load_dataset() from HF Hub ──► list[Document] │ │ │ │ vector_store.py │ │ ├── RecursiveCharacterTextSplitter ──► Chunks │ │ ├── HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors │ │ └── FAISS.from_documents() ──► Index │ │ │ │ llm.py │ │ └── Groq(api_key) ──► Groq Client │ │ │ │ rag_pipeline.py │ │ └── RAGPipeline(index, groq_client) ──► Ready │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ HF SPACE (per query) │ │ │ │ Student Question │ │ │ │ │ ▼ │ │ rag_pipeline.query() │ │ ├── vector_store.retrieve() ──► Top-K Chunks │ │ └── llm.generate_answer() ──► Grounded Answer │ │ │ │ app.py ──► Gradio UI ──► Student sees answer │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## File Structure ``` astrobot/ │ ├── app.py # Gradio UI — entry point for HF Spaces ├── config.py # All configuration (env vars, hyperparameters) ├── data_loader.py # HF dataset fetching + Document creation ├── vector_store.py # Chunking, embedding, FAISS index ├── llm.py # Groq client + prompt engineering ├── rag_pipeline.py # Orchestrates retrieval → generation │ ├── convert_pdfs.py # Offline helper: PDFs → HF Parquet dataset ├── requirements.txt # Python dependencies └── PROJECT.md # This file ``` --- ## Module Responsibilities | Module | Single Responsibility | |---|---| | `config.py` | Central source of truth for all settings. Change a parameter once here. | | `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. | | `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. | | `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. | | `rag_pipeline.py` | Glue layer: validate query → retrieve → generate → return `RAGResponse`. | | `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. | | `convert_pdfs.py` | One-time offline script: extract PDF pages → push Parquet to HF Hub. | This separation means: - You can swap **FAISS → Pinecone** by editing only `vector_store.py`. - You can swap **Groq → OpenAI** by editing only `llm.py`. - You can change the **system prompt** (persona, guardrails) in only `llm.py`. - You can replace the **UI** without touching any backend logic. --- ## Data Pipeline ### Step 1 — Prepare your PDFs (run locally) Place your astrology textbook PDFs in a folder and run: ```bash pip install pypdf datasets huggingface-hub python convert_pdfs.py \ --pdf_dir ./astrology_books \ --repo_id YOUR_USERNAME/astrology-course-materials \ --private # optional ``` This will: 1. Extract text from each PDF page-by-page. 2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`. 3. Push it to HF Hub as a Parquet-backed dataset. ### Step 2 — Connect to the Space Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below). ### Step 3 — What happens at startup ``` load_dataset() # ~30s for large datasets RecursiveCharacterTextSplitter # chunk_size=512, overlap=64 HuggingFaceEmbeddings # ~60s to encode all chunks FAISS.from_documents() # <5s ``` The index is built once per Space restart and held in memory. --- ## Setup & Deployment ### 1. Create a Hugging Face Space - Go to [huggingface.co/new-space](https://huggingface.co/new-space) - **SDK:** Gradio - **Hardware:** CPU Basic (free) ### 2. Upload files Upload these files to the Space repository: ``` app.py config.py data_loader.py vector_store.py llm.py rag_pipeline.py requirements.txt ``` ### 3. Set secrets Go to **Space → Settings → Repository secrets → New secret** | Secret Name | Value | |---|---| | `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) → API Keys | | `HF_DATASET` | `your-username/your-dataset-name` | | `HF_TOKEN` | Your HF token (only needed for **private** datasets) | ### 4. Done The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing). --- ## Environment Variables All variables are read in `config.py`. You can also set them locally for development: ```bash export GROQ_API_KEY="gsk_..." export HF_DATASET="yourname/astrology-course-materials" export HF_TOKEN="" # leave blank for public datasets python app.py ``` --- ## How to Add New Course Materials 1. Add the new PDF(s) to your `./astrology_books/` folder. 2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset). 3. **Restart the HF Space** — it will re-index on next startup. No code changes required. --- ## Limitations & Guardrails | Limitation | Detail | |---|---| | **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. | | **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. | | **No chart interpretation** | Questions about specific birth charts are declined. | | **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3–5 min cold start). | | **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. | | **Language** | Optimised for English. Other languages may work but are untested. | --- ## Troubleshooting ### Space fails to start - Check **Logs** tab in the Space for Python errors. - Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`). ### "GROQ_API_KEY is not set" - Add the secret in Space → Settings → Repository secrets. ### "No usable text column found" - Your Parquet dataset doesn't have a column named `text`, `content`, etc. - Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`. ### Answers seem unrelated to the question - Increase `TOP_K` in `config.py` (try 7–10). - Decrease `CHUNK_SIZE` (try 256) for finer granularity. - Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed. ### Groq rate limit errors - Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI. ---