Spaces:
Sleeping
Sleeping
| # π AstroBot β RAG-Powered educational AI System | |
| >AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A. | |
| >It demonstrates: | |
| >End-to-end PDF ingestion β structured Parquet datasets | |
| >Semantic indexing with FAISS | |
| >Context-grounded LLM responses via Groq (LLaMA-3) | |
| >Modular architecture enabling easy LLM or vector DB swapping | |
| >Public deployment on Hugging Face Spaces (CI/CD via git push) | |
| --- | |
| ## Table of Contents | |
| 1. [Project Overview](#project-overview) | |
| 2. [Tech Stack](#tech-stack) | |
| 3. [Architecture](#architecture) | |
| 4. [File Structure](#file-structure) | |
| 5. [Module Responsibilities](#module-responsibilities) | |
| 6. [Data Pipeline](#data-pipeline) | |
| 7. [Setup & Deployment](#setup--deployment) | |
| 8. [Environment Variables](#environment-variables) | |
| 9. [How to Add New Course Materials](#how-to-add-new-course-materials) | |
| 10. [Limitations & Guardrails](#limitations--guardrails) | |
| 11. [Troubleshooting](#troubleshooting) | |
| --- | |
| ## Project Overview | |
| AstroBot is a **Retrieval-Augmented Generation (RAG)** chatbot deployed on **Hugging Face Spaces**. | |
| It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials. | |
| ## Tech Stack | |
| | Layer | Technology | Why | | |
| |---|---|---| | |
| | LLM | **Groq + LLaMA-3.1-8b-instant** | Fastest open-model inference; free tier generous | | |
| | Vector DB | **FAISS (CPU)** | No external service needed; runs inside the Space | | |
| | Embeddings | **sentence-transformers/all-MiniLM-L6-v2** | Lightweight, accurate, runs locally | | |
| | Dataset | **HF Datasets (Parquet)** | Native HF Hub format; handles large PDFs well | | |
| | Framework | **LangChain** | Chunking utilities and Document schema | | |
| | UI | **Gradio 4** | Native to HF Spaces; quick to build, mobile-friendly | | |
| | Hosting | **Hugging Face Spaces** | Free GPU/CPU hosting; CI/CD via git push | | |
| ### What it does | |
| - Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory. | |
| - Grounds every answer in actual course material (no hallucination of unsupported facts). | |
| - Clearly declines to make personal predictions or interpret individual birth charts. | |
| ### What it does NOT do | |
| - Make predictions of any kind. | |
| - Interpret a specific person's chart. | |
| - Answer questions unrelated to astrology concepts. | |
| --- | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β OFFLINE (once) β | |
| β β | |
| β Astrology PDFs βββΊ convert_pdfs.py βββΊ HF Dataset (Parquet) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β HF SPACE (at startup) β | |
| β β | |
| β data_loader.py β | |
| β βββ load_dataset() from HF Hub βββΊ list[Document] β | |
| β β | |
| β vector_store.py β | |
| β βββ RecursiveCharacterTextSplitter βββΊ Chunks β | |
| β βββ HuggingFaceEmbeddings (MiniLM-L6) βββΊ Vectors β | |
| β βββ FAISS.from_documents() βββΊ Index β | |
| β β | |
| β llm.py β | |
| β βββ Groq(api_key) βββΊ Groq Client β | |
| β β | |
| β rag_pipeline.py β | |
| β βββ RAGPipeline(index, groq_client) βββΊ Ready β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β HF SPACE (per query) β | |
| β β | |
| β Student Question β | |
| β β β | |
| β βΌ β | |
| β rag_pipeline.query() β | |
| β βββ vector_store.retrieve() βββΊ Top-K Chunks β | |
| β βββ llm.generate_answer() βββΊ Grounded Answer β | |
| β β | |
| β app.py βββΊ Gradio UI βββΊ Student sees answer β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## File Structure | |
| ``` | |
| astrobot/ | |
| β | |
| βββ app.py # Gradio UI β entry point for HF Spaces | |
| βββ config.py # All configuration (env vars, hyperparameters) | |
| βββ data_loader.py # HF dataset fetching + Document creation | |
| βββ vector_store.py # Chunking, embedding, FAISS index | |
| βββ llm.py # Groq client + prompt engineering | |
| βββ rag_pipeline.py # Orchestrates retrieval β generation | |
| β | |
| βββ convert_pdfs.py # Offline helper: PDFs β HF Parquet dataset | |
| βββ requirements.txt # Python dependencies | |
| βββ PROJECT.md # This file | |
| ``` | |
| --- | |
| ## Module Responsibilities | |
| | Module | Single Responsibility | | |
| |---|---| | |
| | `config.py` | Central source of truth for all settings. Change a parameter once here. | | |
| | `data_loader.py` | Fetch data from HF Hub; detect text column; return `list[Document]`. | | |
| | `vector_store.py` | Chunk text; embed with sentence-transformers; build & query FAISS index. | | |
| | `llm.py` | Validate Groq key; build system prompt; call Groq API; return answer string. | | |
| | `rag_pipeline.py` | Glue layer: validate query β retrieve β generate β return `RAGResponse`. | | |
| | `app.py` | UI only: Gradio layout, event wiring, error display. No business logic. | | |
| | `convert_pdfs.py` | One-time offline script: extract PDF pages β push Parquet to HF Hub. | | |
| This separation means: | |
| - You can swap **FAISS β Pinecone** by editing only `vector_store.py`. | |
| - You can swap **Groq β OpenAI** by editing only `llm.py`. | |
| - You can change the **system prompt** (persona, guardrails) in only `llm.py`. | |
| - You can replace the **UI** without touching any backend logic. | |
| --- | |
| ## Data Pipeline | |
| ### Step 1 β Prepare your PDFs (run locally) | |
| Place your astrology textbook PDFs in a folder and run: | |
| ```bash | |
| pip install pypdf datasets huggingface-hub | |
| python convert_pdfs.py \ | |
| --pdf_dir ./astrology_books \ | |
| --repo_id YOUR_USERNAME/astrology-course-materials \ | |
| --private # optional | |
| ``` | |
| This will: | |
| 1. Extract text from each PDF page-by-page. | |
| 2. Build a `datasets.Dataset` with columns: `source`, `page`, `text`. | |
| 3. Push it to HF Hub as a Parquet-backed dataset. | |
| ### Step 2 β Connect to the Space | |
| Set `HF_DATASET=YOUR_USERNAME/astrology-course-materials` in Space secrets (see below). | |
| ### Step 3 β What happens at startup | |
| ``` | |
| load_dataset() # ~30s for large datasets | |
| RecursiveCharacterTextSplitter # chunk_size=512, overlap=64 | |
| HuggingFaceEmbeddings # ~60s to encode all chunks | |
| FAISS.from_documents() # <5s | |
| ``` | |
| The index is built once per Space restart and held in memory. | |
| --- | |
| ## Setup & Deployment | |
| ### 1. Create a Hugging Face Space | |
| - Go to [huggingface.co/new-space](https://huggingface.co/new-space) | |
| - **SDK:** Gradio | |
| - **Hardware:** CPU Basic (free) | |
| ### 2. Upload files | |
| Upload these files to the Space repository: | |
| ``` | |
| app.py | |
| config.py | |
| data_loader.py | |
| vector_store.py | |
| llm.py | |
| rag_pipeline.py | |
| requirements.txt | |
| ``` | |
| ### 3. Set secrets | |
| Go to **Space β Settings β Repository secrets β New secret** | |
| | Secret Name | Value | | |
| |---|---| | |
| | `GROQ_API_KEY` | From [console.groq.com](https://console.groq.com) β API Keys | | |
| | `HF_DATASET` | `your-username/your-dataset-name` | | |
| | `HF_TOKEN` | Your HF token (only needed for **private** datasets) | | |
| ### 4. Done | |
| The Space will auto-rebuild. Startup takes ~3β5 minutes (embedding model download + indexing). | |
| --- | |
| ## Environment Variables | |
| All variables are read in `config.py`. You can also set them locally for development: | |
| ```bash | |
| export GROQ_API_KEY="gsk_..." | |
| export HF_DATASET="yourname/astrology-course-materials" | |
| export HF_TOKEN="" # leave blank for public datasets | |
| python app.py | |
| ``` | |
| --- | |
| ## How to Add New Course Materials | |
| 1. Add the new PDF(s) to your `./astrology_books/` folder. | |
| 2. Re-run `convert_pdfs.py` (it will overwrite the existing dataset). | |
| 3. **Restart the HF Space** β it will re-index on next startup. | |
| No code changes required. | |
| --- | |
| ## Limitations & Guardrails | |
| | Limitation | Detail | | |
| |---|---| | |
| | **No predictions** | The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level. | | |
| | **Grounded answers only** | If the answer isn't in the course materials, AstroBot says so rather than hallucinating. | | |
| | **No chart interpretation** | Questions about specific birth charts are declined. | | |
| | **Index is in-memory** | The FAISS index is rebuilt on every Space restart (~3β5 min cold start). | | |
| | **Context window** | Top-5 chunks are retrieved per query. Adjust `TOP_K` in `config.py`. | | |
| | **Language** | Optimised for English. Other languages may work but are untested. | | |
| --- | |
| ## Troubleshooting | |
| ### Space fails to start | |
| - Check **Logs** tab in the Space for Python errors. | |
| - Verify all 3 secrets are set (`GROQ_API_KEY`, `HF_DATASET`). | |
| ### "GROQ_API_KEY is not set" | |
| - Add the secret in Space β Settings β Repository secrets. | |
| ### "No usable text column found" | |
| - Your Parquet dataset doesn't have a column named `text`, `content`, etc. | |
| - Either rename the column in your dataset, or add your column name to `text_column_candidates` in `config.py`. | |
| ### Answers seem unrelated to the question | |
| - Increase `TOP_K` in `config.py` (try 7β10). | |
| - Decrease `CHUNK_SIZE` (try 256) for finer granularity. | |
| - Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed. | |
| ### Groq rate limit errors | |
| - Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI. | |
| --- | |