Spaces:

Kirtan001
/

test

Sleeping

App Files Files Community

test / PROJECT_DOCUMENTATION.md

Kirtan001

Fresh Start: Clean Repo without binaries

ad06665 2 days ago

preview code

raw

history blame contribute delete

8.49 kB

	# ChatBoog: Space Satellite Assistant - Project Documentation

	## 1) Problem Statement
	The goal of this project is simple and public-facing: make satellite data from every country accessible through a chat interface. Users should be able to ask questions in natural language and get accurate, grounded answers.

	We chose Gunter's Space Page as the source because it is comprehensive, open, and consistently structured:
	- Source: `https://space.skyrocket.de/directories/sat_c.htm`

	## 2) Why Scraping (Not API / Dataset)
	We evaluated three ways to collect the data:
	1. Public API - none available for this exact dataset.
	2. Pre-built dataset - not complete and often outdated or locked.
	3. Web scraping - free, reliable for this site, and under our control.

	We selected scraping because it is cost-effective, repeatable, and gives us the most complete coverage.

	## 3) Website Structure (How We Navigate the Data)
	The site exposes satellite data through a 4-level structure. Understanding this structure is the key to correct scraping.

	1. Main directory (all countries)
	- Contains every country name and link.
	2. Country page
	- Lists all satellite categories for that country.
	3. Category page
	- Lists satellites (names, sometimes operators) with links to the satellite pages.
	4. Satellite page
	- Contains the full detail: description, specifications, launch history, and metadata.

	This structure informed the exact scraper design.

	## 4) Data Collection Strategy (Step-by-Step)
	We intentionally built the pipeline in small, reliable steps first, then combined it into a fast final scraper.

	### Step 1 - Countries
	We first scraped all countries and their links. This gives the root of the tree.

	### Step 2 - Categories
	For the target country (initially China), we scraped all categories and their links.

	### Step 3 - Satellites
	For each category, we captured:
	- Satellite name
	- Operator (if available)
	- Link to the satellite details page

	### Step 4 - Satellite Details
	We then scraped each satellite detail page to collect:
	- Description
	- Specifications table
	- Launch history
	- Images

	This breakdown made the process understandable and easier to validate.

	## 5) Selenium First, Then BeautifulSoup
	We initially used Selenium to explore and verify the structure. Selenium is heavy but good for discovery:
	- Opens a real browser
	- Validates what is visible
	- Helps debug layout and missing fields

	Once the structure was clear, we switched to BeautifulSoup for production because:
	- Faster
	- Lightweight
	- More stable for bulk scraping

	The experimental and step scripts are kept in `development_logs/` for reference and auditing.

	## 6) Why SQLite for Intermediate Storage
	We store scraped links and metadata in SQLite because:
	- It is simple and fast for local use
	- Easy to query and inspect
	- Great for pipeline checkpoints
	- No external database required

	### What is stored in SQLite
	We store:
	- Countries and their links
	- Categories and their links per country
	- Satellites with name, category, operator, and detail page URL

	This allows us to restart scraping without repeating earlier steps.

	## 7) Final Scraper (Production Grade)
	The final scraper is in `src/full_scraper.py`. It:
	- Reads all satellite links from SQLite
	- Fetches each satellite page using a persistent requests session
	- Extracts structured data reliably
	- Stores everything into one clean JSON file

	### What the final JSON contains
	Each satellite record includes:
	- `id`, `name`, `country`, `category`, `operator`, `url`
	- `description`
	- `specifications` (parsed from the `#satdata` table)
	- `launch_history` (parsed from the `#satlist` table)
	- `images`

	This is the core dataset used for the RAG pipeline.

	## 8) Why JSON for Final Output
	We store all satellite details in a single JSON file because:
	- The dataset is small enough to load quickly
	- It is portable and easy to parse
	- It integrates cleanly with RAG chunking tools

	This is the final source of truth for embedding and retrieval.

	## 9) RAG Pipeline (Chunking + Embeddings)
	Once we have clean satellite data, we convert it into a knowledge base for semantic search.

	### 9.1 Document Formatting
	In `src/build_rag_index.py`, each satellite is converted into a structured Markdown document:
	- Title includes satellite name
	- Country and operator are included at the top
	- Description section
	- Specifications section (with explicit key re-labeling for search clarity)
	- Launch history section

	Why this matters:
	- Markdown adds clear structure for chunking
	- Injecting the satellite name into each spec line improves semantic matching
	- Launch details are normalized into consistent sentences for retrieval

	### 9.2 Chunking Strategy
	We split each Markdown document using:
	- `RecursiveCharacterTextSplitter`
	- `chunk_size=1000`, `chunk_overlap=200`
	- Separators: headings, paragraphs, lines, spaces

	Why this matters:
	- Prevents context overflow
	- Keeps related information together
	- Improves recall during retrieval

	### 9.3 Embeddings
	We use `BAAI/bge-small-en-v1.5` because:
	- Strong semantic search performance
	- Lightweight and fast for local use
	- Normalized embeddings improve similarity search

	### 9.4 Vector Storage (ChromaDB)
	We store embeddings in ChromaDB with local persistence:
	- Collection name: `satellites`
	- Vector size: 384 (matches BGE-small)
	- Distance: cosine

	Why ChromaDB:
	- Native integration with LangChain
	- tailored for local/embedded usage
	- Stability: Handles file locking better than Qdrant in stateless/ephemeral environments like Hugging Face Spaces.

	## 10) Chatbot Logic (Semantic Retrieval + LLM)
	The chatbot runs in Streamlit (`src/app.py`) and uses:
	- ChromaDB retriever for relevant chunks
	- Groq LLM (`llama-3.3-70b-versatile`) for answer generation

	### Prompt Design (Why it works)
	The prompt explicitly enforces:
	- Precision with numbers and technical fields
	- Honest fallback if data is missing
	- Use of provided context only (avoid hallucinations)

	This keeps answers grounded and accurate.

	## 11) Testing and Quality Checks
	We include `tests/test_rag.py` to validate:
	- Model initialization
	- Retrieval quality
	- Hallucination resistance for out-of-scope questions

	This provides a repeatable sanity check for the RAG system.

	## 12) Deployment and Reproducibility
	We support containerized deployment with a "Build-on-Start" strategy to handle large data files:

	1. Lazy Indexing (Self-Healing):
	- The application (`src/app.py` -> `src/rag_engine.py`) automatically checks if the ChromaDB index exists/is empty on startup.
	- If empty (first run on cloud), it triggers `src/build_rag_index.py` to rebuild the index from the JSON data.
	- This bypasses the need to push large binary database files (`.sqlite3`, `.bin`) to git, avoiding Git LFS quotas and rejection errors.

	2. Container Setup:
	- `Dockerfile` sets up the environment, including `start.sh` handling permissions.
	- `.gitignore` explicitly excludes `data/chroma_db` to ensure a clean slate for deployment.
	- `.github/workflows/sync_to_huggingface.yml` handles the sync to Hugging Face Spaces.

	## 13) End-to-End Flow (Project Diagram)
	```
	Main Page (countries)
	\|- Country Page
	\| \|- Category Page
	\| \|- Satellite List (name + operator + link)
	\| \|- Satellite Detail Page (full data)
	\| \|- JSON Output
	\| \|- SQLite index for tracking
	```

	## 14) What We Have Achieved
	We now have:
	- A verified scraper pipeline (BeautifulSoup)
	- Clean, structured satellite JSON data
	- A reproducible RAG pipeline (chunking + embeddings)
	- A working Streamlit chat UI
	- Docker and Hugging Face deployment readiness

	## 15) Why This Approach Works
	- Scalable: We can add more countries easily
	- Reliable: Stored checkpoints in SQLite
	- Cost-effective: No paid APIs
	- Accurate: Data comes directly from the source
	- RAG-ready: JSON -> chunking -> embeddings -> ChromaDB

	## 16) Next Steps (Optional)
	- Expand scraping from China to all countries
	- Add scheduled refresh jobs
	- Add evaluation metrics for RAG accuracy
	- Add UI filters (country, category)

	---

	### Files Referenced
	- `src/full_scraper.py`
	- `src/build_rag_index.py` (includes `build_index` entry point)
	- `src/rag_engine.py` (lazy indexing logic)
	- `src/app.py`
	- `tests/test_rag.py`
	- `Dockerfile`
	- `start.sh`
	- `README.md`
	- `development_logs/` (Contains legacy scripts: `diagnose.py`, `evaluate_rag.py`, etc.)

	---

	If you want, I can also generate a shorter version for README or a slide-friendly summary.