| # ChatBoog: Space Satellite Assistant - Project Documentation | |
| ## 1) Problem Statement | |
| The goal of this project is simple and public-facing: make satellite data from every country accessible through a chat interface. Users should be able to ask questions in natural language and get accurate, grounded answers. | |
| We chose Gunter's Space Page as the source because it is comprehensive, open, and consistently structured: | |
| - Source: `https://space.skyrocket.de/directories/sat_c.htm` | |
| ## 2) Why Scraping (Not API / Dataset) | |
| We evaluated three ways to collect the data: | |
| 1. Public API - none available for this exact dataset. | |
| 2. Pre-built dataset - not complete and often outdated or locked. | |
| 3. Web scraping - free, reliable for this site, and under our control. | |
| We selected scraping because it is cost-effective, repeatable, and gives us the most complete coverage. | |
| ## 3) Website Structure (How We Navigate the Data) | |
| The site exposes satellite data through a 4-level structure. Understanding this structure is the key to correct scraping. | |
| 1. Main directory (all countries) | |
| - Contains every country name and link. | |
| 2. Country page | |
| - Lists all satellite categories for that country. | |
| 3. Category page | |
| - Lists satellites (names, sometimes operators) with links to the satellite pages. | |
| 4. Satellite page | |
| - Contains the full detail: description, specifications, launch history, and metadata. | |
| This structure informed the exact scraper design. | |
| ## 4) Data Collection Strategy (Step-by-Step) | |
| We intentionally built the pipeline in small, reliable steps first, then combined it into a fast final scraper. | |
| ### Step 1 - Countries | |
| We first scraped all countries and their links. This gives the root of the tree. | |
| ### Step 2 - Categories | |
| For the target country (initially China), we scraped all categories and their links. | |
| ### Step 3 - Satellites | |
| For each category, we captured: | |
| - Satellite name | |
| - Operator (if available) | |
| - Link to the satellite details page | |
| ### Step 4 - Satellite Details | |
| We then scraped each satellite detail page to collect: | |
| - Description | |
| - Specifications table | |
| - Launch history | |
| - Images | |
| This breakdown made the process understandable and easier to validate. | |
| ## 5) Selenium First, Then BeautifulSoup | |
| We initially used Selenium to explore and verify the structure. Selenium is heavy but good for discovery: | |
| - Opens a real browser | |
| - Validates what is visible | |
| - Helps debug layout and missing fields | |
| Once the structure was clear, we switched to BeautifulSoup for production because: | |
| - Faster | |
| - Lightweight | |
| - More stable for bulk scraping | |
| The experimental and step scripts are kept in `development_logs/` for reference and auditing. | |
| ## 6) Why SQLite for Intermediate Storage | |
| We store scraped links and metadata in SQLite because: | |
| - It is simple and fast for local use | |
| - Easy to query and inspect | |
| - Great for pipeline checkpoints | |
| - No external database required | |
| ### What is stored in SQLite | |
| We store: | |
| - Countries and their links | |
| - Categories and their links per country | |
| - Satellites with name, category, operator, and detail page URL | |
| This allows us to restart scraping without repeating earlier steps. | |
| ## 7) Final Scraper (Production Grade) | |
| The final scraper is in `src/full_scraper.py`. It: | |
| - Reads all satellite links from SQLite | |
| - Fetches each satellite page using a persistent requests session | |
| - Extracts structured data reliably | |
| - Stores everything into one clean JSON file | |
| ### What the final JSON contains | |
| Each satellite record includes: | |
| - `id`, `name`, `country`, `category`, `operator`, `url` | |
| - `description` | |
| - `specifications` (parsed from the `#satdata` table) | |
| - `launch_history` (parsed from the `#satlist` table) | |
| - `images` | |
| This is the core dataset used for the RAG pipeline. | |
| ## 8) Why JSON for Final Output | |
| We store all satellite details in a single JSON file because: | |
| - The dataset is small enough to load quickly | |
| - It is portable and easy to parse | |
| - It integrates cleanly with RAG chunking tools | |
| This is the final source of truth for embedding and retrieval. | |
| ## 9) RAG Pipeline (Chunking + Embeddings) | |
| Once we have clean satellite data, we convert it into a knowledge base for semantic search. | |
| ### 9.1 Document Formatting | |
| In `src/build_rag_index.py`, each satellite is converted into a structured Markdown document: | |
| - Title includes satellite name | |
| - Country and operator are included at the top | |
| - Description section | |
| - Specifications section (with explicit key re-labeling for search clarity) | |
| - Launch history section | |
| Why this matters: | |
| - Markdown adds clear structure for chunking | |
| - Injecting the satellite name into each spec line improves semantic matching | |
| - Launch details are normalized into consistent sentences for retrieval | |
| ### 9.2 Chunking Strategy | |
| We split each Markdown document using: | |
| - `RecursiveCharacterTextSplitter` | |
| - `chunk_size=1000`, `chunk_overlap=200` | |
| - Separators: headings, paragraphs, lines, spaces | |
| Why this matters: | |
| - Prevents context overflow | |
| - Keeps related information together | |
| - Improves recall during retrieval | |
| ### 9.3 Embeddings | |
| We use `BAAI/bge-small-en-v1.5` because: | |
| - Strong semantic search performance | |
| - Lightweight and fast for local use | |
| - Normalized embeddings improve similarity search | |
| ### 9.4 Vector Storage (ChromaDB) | |
| We store embeddings in ChromaDB with local persistence: | |
| - Collection name: `satellites` | |
| - Vector size: 384 (matches BGE-small) | |
| - Distance: cosine | |
| Why ChromaDB: | |
| - Native integration with LangChain | |
| - tailored for local/embedded usage | |
| - **Stability:** Handles file locking better than Qdrant in stateless/ephemeral environments like Hugging Face Spaces. | |
| ## 10) Chatbot Logic (Semantic Retrieval + LLM) | |
| The chatbot runs in Streamlit (`src/app.py`) and uses: | |
| - ChromaDB retriever for relevant chunks | |
| - Groq LLM (`llama-3.3-70b-versatile`) for answer generation | |
| ### Prompt Design (Why it works) | |
| The prompt explicitly enforces: | |
| - Precision with numbers and technical fields | |
| - Honest fallback if data is missing | |
| - Use of provided context only (avoid hallucinations) | |
| This keeps answers grounded and accurate. | |
| ## 11) Testing and Quality Checks | |
| We include `tests/test_rag.py` to validate: | |
| - Model initialization | |
| - Retrieval quality | |
| - Hallucination resistance for out-of-scope questions | |
| This provides a repeatable sanity check for the RAG system. | |
| ## 12) Deployment and Reproducibility | |
| We support containerized deployment with a **"Build-on-Start"** strategy to handle large data files: | |
| 1. **Lazy Indexing (Self-Healing):** | |
| - The application (`src/app.py` -> `src/rag_engine.py`) automatically checks if the ChromaDB index exists/is empty on startup. | |
| - If empty (first run on cloud), it triggers `src/build_rag_index.py` to rebuild the index from the JSON data. | |
| - This bypasses the need to push large binary database files (`.sqlite3`, `.bin`) to git, avoiding Git LFS quotas and rejection errors. | |
| 2. **Container Setup:** | |
| - `Dockerfile` sets up the environment, including `start.sh` handling permissions. | |
| - `.gitignore` explicitly excludes `data/chroma_db` to ensure a clean slate for deployment. | |
| - `.github/workflows/sync_to_huggingface.yml` handles the sync to Hugging Face Spaces. | |
| ## 13) End-to-End Flow (Project Diagram) | |
| ``` | |
| Main Page (countries) | |
| |- Country Page | |
| | |- Category Page | |
| | |- Satellite List (name + operator + link) | |
| | |- Satellite Detail Page (full data) | |
| | |- JSON Output | |
| | |- SQLite index for tracking | |
| ``` | |
| ## 14) What We Have Achieved | |
| We now have: | |
| - A verified scraper pipeline (BeautifulSoup) | |
| - Clean, structured satellite JSON data | |
| - A reproducible RAG pipeline (chunking + embeddings) | |
| - A working Streamlit chat UI | |
| - Docker and Hugging Face deployment readiness | |
| ## 15) Why This Approach Works | |
| - Scalable: We can add more countries easily | |
| - Reliable: Stored checkpoints in SQLite | |
| - Cost-effective: No paid APIs | |
| - Accurate: Data comes directly from the source | |
| - RAG-ready: JSON -> chunking -> embeddings -> ChromaDB | |
| ## 16) Next Steps (Optional) | |
| - Expand scraping from China to all countries | |
| - Add scheduled refresh jobs | |
| - Add evaluation metrics for RAG accuracy | |
| - Add UI filters (country, category) | |
| --- | |
| ### Files Referenced | |
| - `src/full_scraper.py` | |
| - `src/build_rag_index.py` (includes `build_index` entry point) | |
| - `src/rag_engine.py` (lazy indexing logic) | |
| - `src/app.py` | |
| - `tests/test_rag.py` | |
| - `Dockerfile` | |
| - `start.sh` | |
| - `README.md` | |
| - `development_logs/` (Contains legacy scripts: `diagnose.py`, `evaluate_rag.py`, etc.) | |
| --- | |
| If you want, I can also generate a shorter version for README or a slide-friendly summary. | |