test / PROJECT_DOCUMENTATION.md
Kirtan001's picture
Fresh Start: Clean Repo without binaries
ad06665

ChatBoog: Space Satellite Assistant - Project Documentation

1) Problem Statement

The goal of this project is simple and public-facing: make satellite data from every country accessible through a chat interface. Users should be able to ask questions in natural language and get accurate, grounded answers.

We chose Gunter's Space Page as the source because it is comprehensive, open, and consistently structured:

  • Source: https://space.skyrocket.de/directories/sat_c.htm

2) Why Scraping (Not API / Dataset)

We evaluated three ways to collect the data:

  1. Public API - none available for this exact dataset.
  2. Pre-built dataset - not complete and often outdated or locked.
  3. Web scraping - free, reliable for this site, and under our control.

We selected scraping because it is cost-effective, repeatable, and gives us the most complete coverage.

3) Website Structure (How We Navigate the Data)

The site exposes satellite data through a 4-level structure. Understanding this structure is the key to correct scraping.

  1. Main directory (all countries)
    • Contains every country name and link.
  2. Country page
    • Lists all satellite categories for that country.
  3. Category page
    • Lists satellites (names, sometimes operators) with links to the satellite pages.
  4. Satellite page
    • Contains the full detail: description, specifications, launch history, and metadata.

This structure informed the exact scraper design.

4) Data Collection Strategy (Step-by-Step)

We intentionally built the pipeline in small, reliable steps first, then combined it into a fast final scraper.

Step 1 - Countries

We first scraped all countries and their links. This gives the root of the tree.

Step 2 - Categories

For the target country (initially China), we scraped all categories and their links.

Step 3 - Satellites

For each category, we captured:

  • Satellite name
  • Operator (if available)
  • Link to the satellite details page

Step 4 - Satellite Details

We then scraped each satellite detail page to collect:

  • Description
  • Specifications table
  • Launch history
  • Images

This breakdown made the process understandable and easier to validate.

5) Selenium First, Then BeautifulSoup

We initially used Selenium to explore and verify the structure. Selenium is heavy but good for discovery:

  • Opens a real browser
  • Validates what is visible
  • Helps debug layout and missing fields

Once the structure was clear, we switched to BeautifulSoup for production because:

  • Faster
  • Lightweight
  • More stable for bulk scraping

The experimental and step scripts are kept in development_logs/ for reference and auditing.

6) Why SQLite for Intermediate Storage

We store scraped links and metadata in SQLite because:

  • It is simple and fast for local use
  • Easy to query and inspect
  • Great for pipeline checkpoints
  • No external database required

What is stored in SQLite

We store:

  • Countries and their links
  • Categories and their links per country
  • Satellites with name, category, operator, and detail page URL

This allows us to restart scraping without repeating earlier steps.

7) Final Scraper (Production Grade)

The final scraper is in src/full_scraper.py. It:

  • Reads all satellite links from SQLite
  • Fetches each satellite page using a persistent requests session
  • Extracts structured data reliably
  • Stores everything into one clean JSON file

What the final JSON contains

Each satellite record includes:

  • id, name, country, category, operator, url
  • description
  • specifications (parsed from the #satdata table)
  • launch_history (parsed from the #satlist table)
  • images

This is the core dataset used for the RAG pipeline.

8) Why JSON for Final Output

We store all satellite details in a single JSON file because:

  • The dataset is small enough to load quickly
  • It is portable and easy to parse
  • It integrates cleanly with RAG chunking tools

This is the final source of truth for embedding and retrieval.

9) RAG Pipeline (Chunking + Embeddings)

Once we have clean satellite data, we convert it into a knowledge base for semantic search.

9.1 Document Formatting

In src/build_rag_index.py, each satellite is converted into a structured Markdown document:

  • Title includes satellite name
  • Country and operator are included at the top
  • Description section
  • Specifications section (with explicit key re-labeling for search clarity)
  • Launch history section

Why this matters:

  • Markdown adds clear structure for chunking
  • Injecting the satellite name into each spec line improves semantic matching
  • Launch details are normalized into consistent sentences for retrieval

9.2 Chunking Strategy

We split each Markdown document using:

  • RecursiveCharacterTextSplitter
  • chunk_size=1000, chunk_overlap=200
  • Separators: headings, paragraphs, lines, spaces

Why this matters:

  • Prevents context overflow
  • Keeps related information together
  • Improves recall during retrieval

9.3 Embeddings

We use BAAI/bge-small-en-v1.5 because:

  • Strong semantic search performance
  • Lightweight and fast for local use
  • Normalized embeddings improve similarity search

9.4 Vector Storage (ChromaDB)

We store embeddings in ChromaDB with local persistence:

  • Collection name: satellites
  • Vector size: 384 (matches BGE-small)
  • Distance: cosine

Why ChromaDB:

  • Native integration with LangChain
  • tailored for local/embedded usage
  • Stability: Handles file locking better than Qdrant in stateless/ephemeral environments like Hugging Face Spaces.

10) Chatbot Logic (Semantic Retrieval + LLM)

The chatbot runs in Streamlit (src/app.py) and uses:

  • ChromaDB retriever for relevant chunks
  • Groq LLM (llama-3.3-70b-versatile) for answer generation

Prompt Design (Why it works)

The prompt explicitly enforces:

  • Precision with numbers and technical fields
  • Honest fallback if data is missing
  • Use of provided context only (avoid hallucinations)

This keeps answers grounded and accurate.

11) Testing and Quality Checks

We include tests/test_rag.py to validate:

  • Model initialization
  • Retrieval quality
  • Hallucination resistance for out-of-scope questions

This provides a repeatable sanity check for the RAG system.

12) Deployment and Reproducibility

We support containerized deployment with a "Build-on-Start" strategy to handle large data files:

  1. Lazy Indexing (Self-Healing):

    • The application (src/app.py -> src/rag_engine.py) automatically checks if the ChromaDB index exists/is empty on startup.
    • If empty (first run on cloud), it triggers src/build_rag_index.py to rebuild the index from the JSON data.
    • This bypasses the need to push large binary database files (.sqlite3, .bin) to git, avoiding Git LFS quotas and rejection errors.
  2. Container Setup:

    • Dockerfile sets up the environment, including start.sh handling permissions.
    • .gitignore explicitly excludes data/chroma_db to ensure a clean slate for deployment.
    • .github/workflows/sync_to_huggingface.yml handles the sync to Hugging Face Spaces.

13) End-to-End Flow (Project Diagram)

Main Page (countries)
|- Country Page
|  |- Category Page
|     |- Satellite List (name + operator + link)
|        |- Satellite Detail Page (full data)
|        |- JSON Output
|     |- SQLite index for tracking

14) What We Have Achieved

We now have:

  • A verified scraper pipeline (BeautifulSoup)
  • Clean, structured satellite JSON data
  • A reproducible RAG pipeline (chunking + embeddings)
  • A working Streamlit chat UI
  • Docker and Hugging Face deployment readiness

15) Why This Approach Works

  • Scalable: We can add more countries easily
  • Reliable: Stored checkpoints in SQLite
  • Cost-effective: No paid APIs
  • Accurate: Data comes directly from the source
  • RAG-ready: JSON -> chunking -> embeddings -> ChromaDB

16) Next Steps (Optional)

  • Expand scraping from China to all countries
  • Add scheduled refresh jobs
  • Add evaluation metrics for RAG accuracy
  • Add UI filters (country, category)

Files Referenced

  • src/full_scraper.py
  • src/build_rag_index.py (includes build_index entry point)
  • src/rag_engine.py (lazy indexing logic)
  • src/app.py
  • tests/test_rag.py
  • Dockerfile
  • start.sh
  • README.md
  • development_logs/ (Contains legacy scripts: diagnose.py, evaluate_rag.py, etc.)

If you want, I can also generate a shorter version for README or a slide-friendly summary.