File size: 8,494 Bytes
ad06665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# ChatBoog: Space Satellite Assistant - Project Documentation

## 1) Problem Statement
The goal of this project is simple and public-facing: make satellite data from every country accessible through a chat interface. Users should be able to ask questions in natural language and get accurate, grounded answers.

We chose Gunter's Space Page as the source because it is comprehensive, open, and consistently structured:
- Source: `https://space.skyrocket.de/directories/sat_c.htm`

## 2) Why Scraping (Not API / Dataset)
We evaluated three ways to collect the data:
1. Public API - none available for this exact dataset.
2. Pre-built dataset - not complete and often outdated or locked.
3. Web scraping - free, reliable for this site, and under our control.

We selected scraping because it is cost-effective, repeatable, and gives us the most complete coverage.

## 3) Website Structure (How We Navigate the Data)
The site exposes satellite data through a 4-level structure. Understanding this structure is the key to correct scraping.

1. Main directory (all countries)
   - Contains every country name and link.
2. Country page
   - Lists all satellite categories for that country.
3. Category page
   - Lists satellites (names, sometimes operators) with links to the satellite pages.
4. Satellite page
   - Contains the full detail: description, specifications, launch history, and metadata.

This structure informed the exact scraper design.

## 4) Data Collection Strategy (Step-by-Step)
We intentionally built the pipeline in small, reliable steps first, then combined it into a fast final scraper.

### Step 1 - Countries
We first scraped all countries and their links. This gives the root of the tree.

### Step 2 - Categories
For the target country (initially China), we scraped all categories and their links.

### Step 3 - Satellites
For each category, we captured:
- Satellite name
- Operator (if available)
- Link to the satellite details page

### Step 4 - Satellite Details
We then scraped each satellite detail page to collect:
- Description
- Specifications table
- Launch history
- Images

This breakdown made the process understandable and easier to validate.

## 5) Selenium First, Then BeautifulSoup
We initially used Selenium to explore and verify the structure. Selenium is heavy but good for discovery:
- Opens a real browser
- Validates what is visible
- Helps debug layout and missing fields

Once the structure was clear, we switched to BeautifulSoup for production because:
- Faster
- Lightweight
- More stable for bulk scraping

The experimental and step scripts are kept in `development_logs/` for reference and auditing.

## 6) Why SQLite for Intermediate Storage
We store scraped links and metadata in SQLite because:
- It is simple and fast for local use
- Easy to query and inspect
- Great for pipeline checkpoints
- No external database required

### What is stored in SQLite
We store:
- Countries and their links
- Categories and their links per country
- Satellites with name, category, operator, and detail page URL

This allows us to restart scraping without repeating earlier steps.

## 7) Final Scraper (Production Grade)
The final scraper is in `src/full_scraper.py`. It:
- Reads all satellite links from SQLite
- Fetches each satellite page using a persistent requests session
- Extracts structured data reliably
- Stores everything into one clean JSON file

### What the final JSON contains
Each satellite record includes:
- `id`, `name`, `country`, `category`, `operator`, `url`
- `description`
- `specifications` (parsed from the `#satdata` table)
- `launch_history` (parsed from the `#satlist` table)
- `images`

This is the core dataset used for the RAG pipeline.

## 8) Why JSON for Final Output
We store all satellite details in a single JSON file because:
- The dataset is small enough to load quickly
- It is portable and easy to parse
- It integrates cleanly with RAG chunking tools

This is the final source of truth for embedding and retrieval.

## 9) RAG Pipeline (Chunking + Embeddings)
Once we have clean satellite data, we convert it into a knowledge base for semantic search.

### 9.1 Document Formatting
In `src/build_rag_index.py`, each satellite is converted into a structured Markdown document:
- Title includes satellite name
- Country and operator are included at the top
- Description section
- Specifications section (with explicit key re-labeling for search clarity)
- Launch history section

Why this matters:
- Markdown adds clear structure for chunking
- Injecting the satellite name into each spec line improves semantic matching
- Launch details are normalized into consistent sentences for retrieval

### 9.2 Chunking Strategy
We split each Markdown document using:
- `RecursiveCharacterTextSplitter`
- `chunk_size=1000`, `chunk_overlap=200`
- Separators: headings, paragraphs, lines, spaces

Why this matters:
- Prevents context overflow
- Keeps related information together
- Improves recall during retrieval

### 9.3 Embeddings
We use `BAAI/bge-small-en-v1.5` because:
- Strong semantic search performance
- Lightweight and fast for local use
- Normalized embeddings improve similarity search

### 9.4 Vector Storage (ChromaDB)
We store embeddings in ChromaDB with local persistence:
- Collection name: `satellites`
- Vector size: 384 (matches BGE-small)
- Distance: cosine

Why ChromaDB:
- Native integration with LangChain
- tailored for local/embedded usage
- **Stability:** Handles file locking better than Qdrant in stateless/ephemeral environments like Hugging Face Spaces.

## 10) Chatbot Logic (Semantic Retrieval + LLM)
The chatbot runs in Streamlit (`src/app.py`) and uses:
- ChromaDB retriever for relevant chunks
- Groq LLM (`llama-3.3-70b-versatile`) for answer generation

### Prompt Design (Why it works)
The prompt explicitly enforces:
- Precision with numbers and technical fields
- Honest fallback if data is missing
- Use of provided context only (avoid hallucinations)

This keeps answers grounded and accurate.

## 11) Testing and Quality Checks
We include `tests/test_rag.py` to validate:
- Model initialization
- Retrieval quality
- Hallucination resistance for out-of-scope questions

This provides a repeatable sanity check for the RAG system.

## 12) Deployment and Reproducibility
We support containerized deployment with a **"Build-on-Start"** strategy to handle large data files:

1.  **Lazy Indexing (Self-Healing):**
    - The application (`src/app.py` -> `src/rag_engine.py`) automatically checks if the ChromaDB index exists/is empty on startup.
    - If empty (first run on cloud), it triggers `src/build_rag_index.py` to rebuild the index from the JSON data.
    - This bypasses the need to push large binary database files (`.sqlite3`, `.bin`) to git, avoiding Git LFS quotas and rejection errors.

2.  **Container Setup:**
    - `Dockerfile` sets up the environment, including `start.sh` handling permissions.
    - `.gitignore` explicitly excludes `data/chroma_db` to ensure a clean slate for deployment.
    - `.github/workflows/sync_to_huggingface.yml` handles the sync to Hugging Face Spaces.

## 13) End-to-End Flow (Project Diagram)
```
Main Page (countries)
|- Country Page
|  |- Category Page
|     |- Satellite List (name + operator + link)
|        |- Satellite Detail Page (full data)
|        |- JSON Output
|     |- SQLite index for tracking
```

## 14) What We Have Achieved
We now have:
- A verified scraper pipeline (BeautifulSoup)
- Clean, structured satellite JSON data
- A reproducible RAG pipeline (chunking + embeddings)
- A working Streamlit chat UI
- Docker and Hugging Face deployment readiness

## 15) Why This Approach Works
- Scalable: We can add more countries easily
- Reliable: Stored checkpoints in SQLite
- Cost-effective: No paid APIs
- Accurate: Data comes directly from the source
- RAG-ready: JSON -> chunking -> embeddings -> ChromaDB

## 16) Next Steps (Optional)
- Expand scraping from China to all countries
- Add scheduled refresh jobs
- Add evaluation metrics for RAG accuracy
- Add UI filters (country, category)

---

### Files Referenced
- `src/full_scraper.py`
- `src/build_rag_index.py` (includes `build_index` entry point)
- `src/rag_engine.py` (lazy indexing logic)
- `src/app.py`
- `tests/test_rag.py`
- `Dockerfile`
- `start.sh`
- `README.md`
- `development_logs/` (Contains legacy scripts: `diagnose.py`, `evaluate_rag.py`, etc.)

---

If you want, I can also generate a shorter version for README or a slide-friendly summary.