ChatBoog: Space Satellite Assistant - Project Documentation
1) Problem Statement
The goal of this project is simple and public-facing: make satellite data from every country accessible through a chat interface. Users should be able to ask questions in natural language and get accurate, grounded answers.
We chose Gunter's Space Page as the source because it is comprehensive, open, and consistently structured:
- Source:
https://space.skyrocket.de/directories/sat_c.htm
2) Why Scraping (Not API / Dataset)
We evaluated three ways to collect the data:
- Public API - none available for this exact dataset.
- Pre-built dataset - not complete and often outdated or locked.
- Web scraping - free, reliable for this site, and under our control.
We selected scraping because it is cost-effective, repeatable, and gives us the most complete coverage.
3) Website Structure (How We Navigate the Data)
The site exposes satellite data through a 4-level structure. Understanding this structure is the key to correct scraping.
- Main directory (all countries)
- Contains every country name and link.
- Country page
- Lists all satellite categories for that country.
- Category page
- Lists satellites (names, sometimes operators) with links to the satellite pages.
- Satellite page
- Contains the full detail: description, specifications, launch history, and metadata.
This structure informed the exact scraper design.
4) Data Collection Strategy (Step-by-Step)
We intentionally built the pipeline in small, reliable steps first, then combined it into a fast final scraper.
Step 1 - Countries
We first scraped all countries and their links. This gives the root of the tree.
Step 2 - Categories
For the target country (initially China), we scraped all categories and their links.
Step 3 - Satellites
For each category, we captured:
- Satellite name
- Operator (if available)
- Link to the satellite details page
Step 4 - Satellite Details
We then scraped each satellite detail page to collect:
- Description
- Specifications table
- Launch history
- Images
This breakdown made the process understandable and easier to validate.
5) Selenium First, Then BeautifulSoup
We initially used Selenium to explore and verify the structure. Selenium is heavy but good for discovery:
- Opens a real browser
- Validates what is visible
- Helps debug layout and missing fields
Once the structure was clear, we switched to BeautifulSoup for production because:
- Faster
- Lightweight
- More stable for bulk scraping
The experimental and step scripts are kept in development_logs/ for reference and auditing.
6) Why SQLite for Intermediate Storage
We store scraped links and metadata in SQLite because:
- It is simple and fast for local use
- Easy to query and inspect
- Great for pipeline checkpoints
- No external database required
What is stored in SQLite
We store:
- Countries and their links
- Categories and their links per country
- Satellites with name, category, operator, and detail page URL
This allows us to restart scraping without repeating earlier steps.
7) Final Scraper (Production Grade)
The final scraper is in src/full_scraper.py. It:
- Reads all satellite links from SQLite
- Fetches each satellite page using a persistent requests session
- Extracts structured data reliably
- Stores everything into one clean JSON file
What the final JSON contains
Each satellite record includes:
id,name,country,category,operator,urldescriptionspecifications(parsed from the#satdatatable)launch_history(parsed from the#satlisttable)images
This is the core dataset used for the RAG pipeline.
8) Why JSON for Final Output
We store all satellite details in a single JSON file because:
- The dataset is small enough to load quickly
- It is portable and easy to parse
- It integrates cleanly with RAG chunking tools
This is the final source of truth for embedding and retrieval.
9) RAG Pipeline (Chunking + Embeddings)
Once we have clean satellite data, we convert it into a knowledge base for semantic search.
9.1 Document Formatting
In src/build_rag_index.py, each satellite is converted into a structured Markdown document:
- Title includes satellite name
- Country and operator are included at the top
- Description section
- Specifications section (with explicit key re-labeling for search clarity)
- Launch history section
Why this matters:
- Markdown adds clear structure for chunking
- Injecting the satellite name into each spec line improves semantic matching
- Launch details are normalized into consistent sentences for retrieval
9.2 Chunking Strategy
We split each Markdown document using:
RecursiveCharacterTextSplitterchunk_size=1000,chunk_overlap=200- Separators: headings, paragraphs, lines, spaces
Why this matters:
- Prevents context overflow
- Keeps related information together
- Improves recall during retrieval
9.3 Embeddings
We use BAAI/bge-small-en-v1.5 because:
- Strong semantic search performance
- Lightweight and fast for local use
- Normalized embeddings improve similarity search
9.4 Vector Storage (ChromaDB)
We store embeddings in ChromaDB with local persistence:
- Collection name:
satellites - Vector size: 384 (matches BGE-small)
- Distance: cosine
Why ChromaDB:
- Native integration with LangChain
- tailored for local/embedded usage
- Stability: Handles file locking better than Qdrant in stateless/ephemeral environments like Hugging Face Spaces.
10) Chatbot Logic (Semantic Retrieval + LLM)
The chatbot runs in Streamlit (src/app.py) and uses:
- ChromaDB retriever for relevant chunks
- Groq LLM (
llama-3.3-70b-versatile) for answer generation
Prompt Design (Why it works)
The prompt explicitly enforces:
- Precision with numbers and technical fields
- Honest fallback if data is missing
- Use of provided context only (avoid hallucinations)
This keeps answers grounded and accurate.
11) Testing and Quality Checks
We include tests/test_rag.py to validate:
- Model initialization
- Retrieval quality
- Hallucination resistance for out-of-scope questions
This provides a repeatable sanity check for the RAG system.
12) Deployment and Reproducibility
We support containerized deployment with a "Build-on-Start" strategy to handle large data files:
Lazy Indexing (Self-Healing):
- The application (
src/app.py->src/rag_engine.py) automatically checks if the ChromaDB index exists/is empty on startup. - If empty (first run on cloud), it triggers
src/build_rag_index.pyto rebuild the index from the JSON data. - This bypasses the need to push large binary database files (
.sqlite3,.bin) to git, avoiding Git LFS quotas and rejection errors.
- The application (
Container Setup:
Dockerfilesets up the environment, includingstart.shhandling permissions..gitignoreexplicitly excludesdata/chroma_dbto ensure a clean slate for deployment..github/workflows/sync_to_huggingface.ymlhandles the sync to Hugging Face Spaces.
13) End-to-End Flow (Project Diagram)
Main Page (countries)
|- Country Page
| |- Category Page
| |- Satellite List (name + operator + link)
| |- Satellite Detail Page (full data)
| |- JSON Output
| |- SQLite index for tracking
14) What We Have Achieved
We now have:
- A verified scraper pipeline (BeautifulSoup)
- Clean, structured satellite JSON data
- A reproducible RAG pipeline (chunking + embeddings)
- A working Streamlit chat UI
- Docker and Hugging Face deployment readiness
15) Why This Approach Works
- Scalable: We can add more countries easily
- Reliable: Stored checkpoints in SQLite
- Cost-effective: No paid APIs
- Accurate: Data comes directly from the source
- RAG-ready: JSON -> chunking -> embeddings -> ChromaDB
16) Next Steps (Optional)
- Expand scraping from China to all countries
- Add scheduled refresh jobs
- Add evaluation metrics for RAG accuracy
- Add UI filters (country, category)
Files Referenced
src/full_scraper.pysrc/build_rag_index.py(includesbuild_indexentry point)src/rag_engine.py(lazy indexing logic)src/app.pytests/test_rag.pyDockerfilestart.shREADME.mddevelopment_logs/(Contains legacy scripts:diagnose.py,evaluate_rag.py, etc.)
If you want, I can also generate a shorter version for README or a slide-friendly summary.