DemoChatBot / PROJECT.md
OnlyTheTruth03's picture
Initial Commit
721ca73 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

πŸ”­ AstroBot β€” RAG-Powered educational AI System

AstroBot is a modular Retrieval-Augmented Generation (RAG) architecture designed for domain-specific educational Q&A. It demonstrates: End-to-end PDF ingestion β†’ structured Parquet datasets Semantic indexing with FAISS Context-grounded LLM responses via Groq (LLaMA-3) Modular architecture enabling easy LLM or vector DB swapping Public deployment on Hugging Face Spaces (CI/CD via git push)


Table of Contents

  1. Project Overview
  2. Tech Stack
  3. Architecture
  4. File Structure
  5. Module Responsibilities
  6. Data Pipeline
  7. Setup & Deployment
  8. Environment Variables
  9. How to Add New Course Materials
  10. Limitations & Guardrails
  11. Troubleshooting

Project Overview

AstroBot is a Retrieval-Augmented Generation (RAG) chatbot deployed on Hugging Face Spaces.
It is designed as an educational companion for astrology students, allowing them to ask natural-language questions about astrological concepts and receive accurate, grounded answers drawn exclusively from course textbooks and materials.

Tech Stack

Layer Technology Why
LLM Groq + LLaMA-3.1-8b-instant Fastest open-model inference; free tier generous
Vector DB FAISS (CPU) No external service needed; runs inside the Space
Embeddings sentence-transformers/all-MiniLM-L6-v2 Lightweight, accurate, runs locally
Dataset HF Datasets (Parquet) Native HF Hub format; handles large PDFs well
Framework LangChain Chunking utilities and Document schema
UI Gradio 4 Native to HF Spaces; quick to build, mobile-friendly
Hosting Hugging Face Spaces Free GPU/CPU hosting; CI/CD via git push

What it does

  • Answers questions about planets, houses, signs, aspects, transits, chart elements, and astrological theory.
  • Grounds every answer in actual course material (no hallucination of unsupported facts).
  • Clearly declines to make personal predictions or interpret individual birth charts.

What it does NOT do

  • Make predictions of any kind.
  • Interpret a specific person's chart.
  • Answer questions unrelated to astrology concepts.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        OFFLINE (once)                           β”‚
β”‚                                                                 β”‚
β”‚  Astrology PDFs ──► convert_pdfs.py ──► HF Dataset (Parquet)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    HF SPACE (at startup)                        β”‚
β”‚                                                                 β”‚
β”‚  data_loader.py                                                 β”‚
β”‚    └── load_dataset() from HF Hub ──► list[Document]           β”‚
β”‚                                                                 β”‚
β”‚  vector_store.py                                                β”‚
β”‚    β”œβ”€β”€ RecursiveCharacterTextSplitter ──► Chunks                β”‚
β”‚    β”œβ”€β”€ HuggingFaceEmbeddings (MiniLM-L6) ──► Vectors           β”‚
β”‚    └── FAISS.from_documents() ──► Index                        β”‚
β”‚                                                                 β”‚
β”‚  llm.py                                                         β”‚
β”‚    └── Groq(api_key) ──► Groq Client                           β”‚
β”‚                                                                 β”‚
β”‚  rag_pipeline.py                                                β”‚
β”‚    └── RAGPipeline(index, groq_client) ──► Ready               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    HF SPACE (per query)                         β”‚
β”‚                                                                 β”‚
β”‚  Student Question                                               β”‚
β”‚       β”‚                                                         β”‚
β”‚       β–Ό                                                         β”‚
β”‚  rag_pipeline.query()                                           β”‚
β”‚       β”œβ”€β”€ vector_store.retrieve()  ──► Top-K Chunks            β”‚
β”‚       └── llm.generate_answer()   ──► Grounded Answer          β”‚
β”‚                                                                 β”‚
β”‚  app.py  ──►  Gradio UI  ──►  Student sees answer              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

File Structure

astrobot/
β”‚
β”œβ”€β”€ app.py              # Gradio UI β€” entry point for HF Spaces
β”œβ”€β”€ config.py           # All configuration (env vars, hyperparameters)
β”œβ”€β”€ data_loader.py      # HF dataset fetching + Document creation
β”œβ”€β”€ vector_store.py     # Chunking, embedding, FAISS index
β”œβ”€β”€ llm.py              # Groq client + prompt engineering
β”œβ”€β”€ rag_pipeline.py     # Orchestrates retrieval β†’ generation
β”‚
β”œβ”€β”€ convert_pdfs.py     # Offline helper: PDFs β†’ HF Parquet dataset
β”œβ”€β”€ requirements.txt    # Python dependencies
└── PROJECT.md          # This file

Module Responsibilities

Module Single Responsibility
config.py Central source of truth for all settings. Change a parameter once here.
data_loader.py Fetch data from HF Hub; detect text column; return list[Document].
vector_store.py Chunk text; embed with sentence-transformers; build & query FAISS index.
llm.py Validate Groq key; build system prompt; call Groq API; return answer string.
rag_pipeline.py Glue layer: validate query β†’ retrieve β†’ generate β†’ return RAGResponse.
app.py UI only: Gradio layout, event wiring, error display. No business logic.
convert_pdfs.py One-time offline script: extract PDF pages β†’ push Parquet to HF Hub.

This separation means:

  • You can swap FAISS β†’ Pinecone by editing only vector_store.py.
  • You can swap Groq β†’ OpenAI by editing only llm.py.
  • You can change the system prompt (persona, guardrails) in only llm.py.
  • You can replace the UI without touching any backend logic.

Data Pipeline

Step 1 β€” Prepare your PDFs (run locally)

Place your astrology textbook PDFs in a folder and run:

pip install pypdf datasets huggingface-hub
python convert_pdfs.py \
    --pdf_dir  ./astrology_books \
    --repo_id  YOUR_USERNAME/astrology-course-materials \
    --private          # optional

This will:

  1. Extract text from each PDF page-by-page.
  2. Build a datasets.Dataset with columns: source, page, text.
  3. Push it to HF Hub as a Parquet-backed dataset.

Step 2 β€” Connect to the Space

Set HF_DATASET=YOUR_USERNAME/astrology-course-materials in Space secrets (see below).

Step 3 β€” What happens at startup

load_dataset()                   # ~30s for large datasets
RecursiveCharacterTextSplitter   # chunk_size=512, overlap=64
HuggingFaceEmbeddings            # ~60s to encode all chunks
FAISS.from_documents()           # <5s

The index is built once per Space restart and held in memory.


Setup & Deployment

1. Create a Hugging Face Space

2. Upload files

Upload these files to the Space repository:

app.py
config.py
data_loader.py
vector_store.py
llm.py
rag_pipeline.py
requirements.txt

3. Set secrets

Go to Space β†’ Settings β†’ Repository secrets β†’ New secret

Secret Name Value
GROQ_API_KEY From console.groq.com β†’ API Keys
HF_DATASET your-username/your-dataset-name
HF_TOKEN Your HF token (only needed for private datasets)

4. Done

The Space will auto-rebuild. Startup takes ~3–5 minutes (embedding model download + indexing).


Environment Variables

All variables are read in config.py. You can also set them locally for development:

export GROQ_API_KEY="gsk_..."
export HF_DATASET="yourname/astrology-course-materials"
export HF_TOKEN=""          # leave blank for public datasets

python app.py

How to Add New Course Materials

  1. Add the new PDF(s) to your ./astrology_books/ folder.
  2. Re-run convert_pdfs.py (it will overwrite the existing dataset).
  3. Restart the HF Space β€” it will re-index on next startup.

No code changes required.


Limitations & Guardrails

Limitation Detail
No predictions The system prompt explicitly forbids AstroBot from making personal predictions. This is enforced at the prompt level.
Grounded answers only If the answer isn't in the course materials, AstroBot says so rather than hallucinating.
No chart interpretation Questions about specific birth charts are declined.
Index is in-memory The FAISS index is rebuilt on every Space restart (~3–5 min cold start).
Context window Top-5 chunks are retrieved per query. Adjust TOP_K in config.py.
Language Optimised for English. Other languages may work but are untested.

Troubleshooting

Space fails to start

  • Check Logs tab in the Space for Python errors.
  • Verify all 3 secrets are set (GROQ_API_KEY, HF_DATASET).

"GROQ_API_KEY is not set"

  • Add the secret in Space β†’ Settings β†’ Repository secrets.

"No usable text column found"

  • Your Parquet dataset doesn't have a column named text, content, etc.
  • Either rename the column in your dataset, or add your column name to text_column_candidates in config.py.

Answers seem unrelated to the question

  • Increase TOP_K in config.py (try 7–10).
  • Decrease CHUNK_SIZE (try 256) for finer granularity.
  • Check that your PDFs are text-extractable (not scanned images). Use OCR first if needed.

Groq rate limit errors

  • Free Groq tier: 14,400 tokens/minute. For a class of many students, consider upgrading or rate-limiting the UI.