Spaces:

Kirtan001
/

test

Sleeping

App Files Files Community

test / PROJECT_DOCUMENTATION.md

Kirtan001

Fresh Start: Clean Repo without binaries

ad06665 2 days ago

preview code

raw

history blame contribute delete

8.49 kB

ChatBoog: Space Satellite Assistant - Project Documentation

1) Problem Statement

The goal of this project is simple and public-facing: make satellite data from every country accessible through a chat interface. Users should be able to ask questions in natural language and get accurate, grounded answers.

We chose Gunter's Space Page as the source because it is comprehensive, open, and consistently structured:

Source: https://space.skyrocket.de/directories/sat_c.htm

2) Why Scraping (Not API / Dataset)

We evaluated three ways to collect the data:

Public API - none available for this exact dataset.
Pre-built dataset - not complete and often outdated or locked.
Web scraping - free, reliable for this site, and under our control.

We selected scraping because it is cost-effective, repeatable, and gives us the most complete coverage.

3) Website Structure (How We Navigate the Data)

The site exposes satellite data through a 4-level structure. Understanding this structure is the key to correct scraping.

Main directory (all countries)
- Contains every country name and link.
Country page
- Lists all satellite categories for that country.
Category page
- Lists satellites (names, sometimes operators) with links to the satellite pages.
Satellite page
- Contains the full detail: description, specifications, launch history, and metadata.

This structure informed the exact scraper design.

4) Data Collection Strategy (Step-by-Step)

We intentionally built the pipeline in small, reliable steps first, then combined it into a fast final scraper.

Step 1 - Countries

We first scraped all countries and their links. This gives the root of the tree.

Step 2 - Categories

For the target country (initially China), we scraped all categories and their links.

Step 3 - Satellites

For each category, we captured:

Satellite name
Operator (if available)
Link to the satellite details page

Step 4 - Satellite Details

We then scraped each satellite detail page to collect:

Description
Specifications table
Launch history
Images

This breakdown made the process understandable and easier to validate.

5) Selenium First, Then BeautifulSoup

We initially used Selenium to explore and verify the structure. Selenium is heavy but good for discovery:

Opens a real browser
Validates what is visible
Helps debug layout and missing fields

Once the structure was clear, we switched to BeautifulSoup for production because:

Faster
Lightweight
More stable for bulk scraping

The experimental and step scripts are kept in development_logs/ for reference and auditing.

6) Why SQLite for Intermediate Storage

We store scraped links and metadata in SQLite because:

It is simple and fast for local use
Easy to query and inspect
Great for pipeline checkpoints
No external database required

What is stored in SQLite

We store:

Countries and their links
Categories and their links per country
Satellites with name, category, operator, and detail page URL

This allows us to restart scraping without repeating earlier steps.

7) Final Scraper (Production Grade)

The final scraper is in src/full_scraper.py. It:

Reads all satellite links from SQLite
Fetches each satellite page using a persistent requests session
Extracts structured data reliably
Stores everything into one clean JSON file

What the final JSON contains

Each satellite record includes:

id, name, country, category, operator, url
description
specifications (parsed from the #satdata table)
launch_history (parsed from the #satlist table)
images

This is the core dataset used for the RAG pipeline.

8) Why JSON for Final Output

We store all satellite details in a single JSON file because:

The dataset is small enough to load quickly
It is portable and easy to parse
It integrates cleanly with RAG chunking tools

This is the final source of truth for embedding and retrieval.

9) RAG Pipeline (Chunking + Embeddings)

Once we have clean satellite data, we convert it into a knowledge base for semantic search.

9.1 Document Formatting

In src/build_rag_index.py, each satellite is converted into a structured Markdown document:

Title includes satellite name
Country and operator are included at the top
Description section
Specifications section (with explicit key re-labeling for search clarity)
Launch history section

Why this matters:

Markdown adds clear structure for chunking
Injecting the satellite name into each spec line improves semantic matching
Launch details are normalized into consistent sentences for retrieval

9.2 Chunking Strategy

We split each Markdown document using:

RecursiveCharacterTextSplitter
chunk_size=1000, chunk_overlap=200
Separators: headings, paragraphs, lines, spaces

Why this matters:

Prevents context overflow
Keeps related information together
Improves recall during retrieval

9.3 Embeddings

We use BAAI/bge-small-en-v1.5 because:

Strong semantic search performance
Lightweight and fast for local use
Normalized embeddings improve similarity search

9.4 Vector Storage (ChromaDB)

We store embeddings in ChromaDB with local persistence:

Collection name: satellites
Vector size: 384 (matches BGE-small)
Distance: cosine

Why ChromaDB:

Native integration with LangChain
tailored for local/embedded usage
Stability: Handles file locking better than Qdrant in stateless/ephemeral environments like Hugging Face Spaces.

10) Chatbot Logic (Semantic Retrieval + LLM)

The chatbot runs in Streamlit (src/app.py) and uses:

ChromaDB retriever for relevant chunks
Groq LLM (llama-3.3-70b-versatile) for answer generation

Prompt Design (Why it works)

The prompt explicitly enforces:

Precision with numbers and technical fields
Honest fallback if data is missing
Use of provided context only (avoid hallucinations)

This keeps answers grounded and accurate.

11) Testing and Quality Checks

We include tests/test_rag.py to validate:

Model initialization
Retrieval quality
Hallucination resistance for out-of-scope questions

This provides a repeatable sanity check for the RAG system.

12) Deployment and Reproducibility

We support containerized deployment with a "Build-on-Start" strategy to handle large data files:

Lazy Indexing (Self-Healing):
- The application (src/app.py -> src/rag_engine.py) automatically checks if the ChromaDB index exists/is empty on startup.
- If empty (first run on cloud), it triggers src/build_rag_index.py to rebuild the index from the JSON data.
- This bypasses the need to push large binary database files (.sqlite3, .bin) to git, avoiding Git LFS quotas and rejection errors.
Container Setup:
- Dockerfile sets up the environment, including start.sh handling permissions.
- .gitignore explicitly excludes data/chroma_db to ensure a clean slate for deployment.
- .github/workflows/sync_to_huggingface.yml handles the sync to Hugging Face Spaces.

13) End-to-End Flow (Project Diagram)

Main Page (countries)
|- Country Page
|  |- Category Page
|     |- Satellite List (name + operator + link)
|        |- Satellite Detail Page (full data)
|        |- JSON Output
|     |- SQLite index for tracking

14) What We Have Achieved

We now have:

A verified scraper pipeline (BeautifulSoup)
Clean, structured satellite JSON data
A reproducible RAG pipeline (chunking + embeddings)
A working Streamlit chat UI
Docker and Hugging Face deployment readiness

15) Why This Approach Works

Scalable: We can add more countries easily
Reliable: Stored checkpoints in SQLite
Cost-effective: No paid APIs
Accurate: Data comes directly from the source
RAG-ready: JSON -> chunking -> embeddings -> ChromaDB

16) Next Steps (Optional)

Expand scraping from China to all countries
Add scheduled refresh jobs
Add evaluation metrics for RAG accuracy
Add UI filters (country, category)

Files Referenced

src/full_scraper.py
src/build_rag_index.py (includes build_index entry point)
src/rag_engine.py (lazy indexing logic)
src/app.py
tests/test_rag.py
Dockerfile
start.sh
README.md
development_logs/ (Contains legacy scripts: diagnose.py, evaluate_rag.py, etc.)

If you want, I can also generate a shorter version for README or a slide-friendly summary.