Sathvik-kota commited on
Commit
9530eca
·
verified ·
1 Parent(s): 3f41e4d

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +171 -136
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Document Search Engine
3
  emoji: 📄
4
  colorFrom: blue
5
  colorTo: purple
@@ -9,166 +9,217 @@ app_file: start.sh
9
  pinned: false
10
  ---
11
 
12
- # Multi-Document Semantic Search Engine
13
  A **production-inspired multi-microservice semantic search system** built over 20+ text documents.
14
 
15
  Designed with:
 
16
  - **Sentence-Transformers** (`all-MiniLM-L6-v2`)
17
  - **Local Embedding Cache**
18
- - **FAISS vector search + persistent storage**
19
- - **LLM-Driven Explanations** (Gemini 2.5 Flash)
20
  - **Google-Gemini-Style Streamlit UI**
21
- - **Microservice Architecture**
22
- - **Full Evaluation Suite**: Accuracy · MRR · nDCG
23
 
24
- Showcasing real-world architecture and ML system design.
25
 
26
  ---
27
 
28
- ## 🚀 Features
 
 
29
 
30
- ### 🔹 Core Search
 
 
 
 
 
31
 
32
- - Embedding-based semantic search over `.txt` docs
33
- - FAISS `IndexFlatL2` on normalized vectors
34
- - Top-K ranking + score display
35
- - Keyword overlap, overlap ratio, top matching sentences
36
 
37
- ### 🔹 Microservice Architecture
38
 
39
- Each logical component runs as a **separate FastAPI microservice**:
40
 
41
  | Service | Responsibility |
42
  |--------|----------------|
43
- | **doc_service** | Load + clean + hash documents |
44
- | **embed_service** | MiniLM embeddings + caching |
45
- | **search_service** | FAISS index build + vector search |
46
- | **explain_service** | Keyword overlap + top sentences + LLM reasoning |
47
- | **api_gateway** | Full pipeline orchestration |
48
- | **streamlit_ui** | Gemini-styled user interface |
49
-
50
- This mirrors real production designs (scalable, modular, interchangeable components).
51
- User
52
-
53
- │ 1. enters query
54
-
55
- Streamlit UI
56
-
57
- │ 2. POST /search
58
-
59
- API Gateway
60
-
61
- 3. Embed query text
62
-
63
- Embed Service
64
- │ (cache hit? yes → return instantly)
65
- │ (cache miss? → compute embedding)
66
-
67
- API Gateway
68
-
69
- │ 4. Search FAISS index
70
-
71
- Search Service
72
- │ returns top-k doc IDs
73
-
74
- API Gateway
75
-
76
- │ 5. Fetch document content
77
-
78
- Doc Service
79
- │ returns full + clean text
80
-
81
- API Gateway
82
-
83
- │ 6. Generate explanation
84
-
85
- Explain Service
86
- │ - keyword overlap
87
- │ - semantic top sentences
88
- │ - Gemini LLM explanation
89
-
90
- API Gateway
91
-
92
- │ 7. Return final response
93
-
94
- Streamlit
95
-
96
-
97
- User sees ranked cards + explanations
98
-
99
-
100
- ### 🔹 Explanations
101
-
102
- For each search result you get:
103
-
104
- - Why this document was matched (LLM explanation)
105
- - Which keywords overlapped (simple heuristic)
106
- - Overlap ratio (0–1)
107
- - Top matching sentences (semantic similarity)
108
-
109
- ---
110
- ### 🔹 **Evaluation Suite**
111
- Metrics included:
112
  - **Accuracy**
113
  - **MRR (Mean Reciprocal Rank)**
114
  - **nDCG@K**
115
- - **Per-query table**
116
- - **Correct vs Incorrect Fetches**
 
117
 
118
  ---
119
- # How Caching Works
120
- Caching happens inside **`embed_service/cache_manager.py`**.We never embed the same document twice.
121
 
122
- ### Prevents re-embedding unchanged files
123
- Each document is identified by: filename + MD5(clean_text)
 
124
 
125
- If `(filename, hash)` already exists:
126
- - the embedding is **loaded instantly**
127
- - avoids recomputing MiniLM embeddings
128
- - makes repeated runs extremely fast
129
 
130
- ### Cache contents:
131
- - `cache/embed_meta.json` → maps filename → `{"hash": "...", "index": int}`
132
- - `cache/embeddings.npy` → stacked embedding matrix
133
 
134
- Caching benefits:
135
- - Faster startup
136
- - Faster user queries
137
- - Less compute usage
138
- - More production-ready
 
 
 
 
 
 
 
 
 
139
 
140
  ---
141
 
142
- # How to Run Embedding Generation
143
- ### Embedding happens automatically during **initialization**:
 
 
 
 
 
 
144
 
145
- `POST /initialize` (handled by API Gateway):
146
 
147
- 1. Load all docs from `data/docs`
148
- 2. Send batch clean texts → **embed_service**
149
- 3. Cache manager stores new embeddings
150
- 4. FAISS index built in **search_service**
151
 
152
- ### Manual Embedding (Optional)
 
153
 
154
- You can call:
155
- POST /embed_batch
156
- POST /embed_document
 
157
 
158
  ---
159
- ### FAISS Persistence (Warm Start Optimization)
160
 
161
- The system stores embeddings **and** the FAISS vector index on disk:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
- - `cache/embeddings.npy` → all stored embeddings
164
- - `cache/embed_meta.json` → filename → hash → embedding index
165
- - `faiss_index.bin` → saved FAISS index
166
- - `faiss_meta.pkl` → mapping of FAISS row → document filename
167
 
168
- On startup, the `search_service` automatically runs:
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  ---
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  ## Design Choices
173
 
174
  ### 1️⃣ **Microservices instead of Monolithic**
@@ -227,21 +278,5 @@ L2 distance is used instead of cosine because:
227
 
228
 
229
 
230
- ## Architecture Overview
231
-
232
- ### High-level Flow
233
-
234
- 1. User asks a question in **Streamlit UI**
235
- 2. UI sends request → **API Gateway** `/search`
236
- 3. Gateway:
237
- - Embeds query via **Embed Service**
238
- - Searches FAISS via **Search Service**
239
- - Fetches full doc text from **Doc Service**
240
- - Gets explanation from **Explain Service**
241
- 4. Response returned to UI with:
242
- - filename, score, preview, full text
243
- - keyword overlap, overlap ratio
244
- - top matching sentences
245
- - optional LLM explanation
246
 
247
 
 
1
  ---
2
+ title: Document Search Engine
3
  emoji: 📄
4
  colorFrom: blue
5
  colorTo: purple
 
9
  pinned: false
10
  ---
11
 
12
+ # Multi-Document Semantic Search Engine
13
  A **production-inspired multi-microservice semantic search system** built over 20+ text documents.
14
 
15
  Designed with:
16
+
17
  - **Sentence-Transformers** (`all-MiniLM-L6-v2`)
18
  - **Local Embedding Cache**
19
+ - **FAISS Vector Search + Persistent Storage**
20
+ - **LLM-Driven Explanations (Gemini 2.5 Flash)**
21
  - **Google-Gemini-Style Streamlit UI**
22
+ - **Real Microservice Architecture**
23
+ - **Full Evaluation Suite (Accuracy · MRR · nDCG)**
24
 
25
+ A complete end-to-end ML system demonstrating real-world architecture & search engineering.
26
 
27
  ---
28
 
29
+ # Features
30
+
31
+ ## 🔹 Core Search
32
 
33
+ - Embedding-based semantic search over `.txt` documents
34
+ - FAISS `IndexFlatL2` on **normalized vectors** (≈ cosine similarity)
35
+ - Top-K ranking + similarity scores
36
+ - Keyword overlap, overlap ratio
37
+ - Top semantic sentences
38
+ - Full-text preview
39
 
40
+ ---
 
 
 
41
 
42
+ ## 🔹 Microservice Architecture (5 FastAPI Services)
43
 
44
+ Each component runs as an **independent microservice**, mirroring real production systems:
45
 
46
  | Service | Responsibility |
47
  |--------|----------------|
48
+ | **doc_service** | Load, clean, normalize, hash, and store documents |
49
+ | **embed_service** | MiniLM embedding generation + caching |
50
+ | **search_service** | FAISS index build, update, and vector search |
51
+ | **explain_service** | Keyword overlap, top sentences, LLM explanations |
52
+ | **api_gateway** | Orchestration: a clean unified API for the UI |
53
+ | **streamlit_ui** | Gemini-style user interface |
54
+
55
+ This separation supports **scalability**, **fault isolation**, and **independent service upgrades** — *like real enterprise ML platforms*.
56
+
57
+ ---
58
+
59
+ ## 🔹 Explanations
60
+
61
+ Every search result includes:
62
+
63
+ - **Keyword overlap**
64
+ - **Semantic overlap ratio**
65
+ - **Top relevant sentences (MiniLM sentence similarity)**
66
+ - **LLM-generated explanation**:
67
+ “Why did this document match your query?”
68
+
69
+ ---
70
+
71
+ ## 🔹 Evaluation Suite
72
+ A built-in evaluation workflow providing:
73
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  - **Accuracy**
75
  - **MRR (Mean Reciprocal Rank)**
76
  - **nDCG@K**
77
+ - Correct vs Incorrect queries
78
+ - Per-query detailed table
79
+ - Ideal for assignments, research, and experiments
80
 
81
  ---
 
 
82
 
83
+ # How Caching Works (MANDATORY SECTION)
84
+
85
+ Caching happens inside **`embed_service/cache_manager.py`**.
86
 
87
+ ### Zero repeated embeddings
88
+ Each document is fingerprinted using:
 
 
89
 
90
+ - **filename**
91
+ - **MD5(cleaned_text)**
 
92
 
93
+ If the hash matches a previously stored file:
94
+ - cached embedding is loaded instantly
95
+ - prevents costly re-embedding
96
+ - improves startup & query latency
97
+
98
+ ### Cache Files:
99
+ - `cache/embed_meta.json` → mapping of filename → `{hash, index}`
100
+ - `cache/embeddings.npy` → matrix of all embeddings
101
+
102
+ ### Benefits
103
+ - Startup: **5–10 seconds → <1 second**
104
+ - Low compute cost
105
+ - Ideal for Hugging Face Spaces
106
+ - Guarantees reproducible results
107
 
108
  ---
109
 
110
+ # FAISS Persistence (Warm Start Optimization)
111
+
112
+ This project saves BOTH embeddings and FAISS index:
113
+
114
+ - `cache/embeddings.npy`
115
+ - `cache/embed_meta.json`
116
+ - `faiss_index.bin`
117
+ - `faiss_meta.pkl`
118
 
119
+ On startup:search_service.indexer.try_load()
120
 
 
 
 
 
121
 
122
+ If found loaded instantly.
123
+ If not → FAISS index is rebuilt from cached embeddings.
124
 
125
+ ### Why this matters?
126
+ - Makes FAISS behave like a **persistent vector database**
127
+ - Extremely important for **Docker**, **Spaces**, and **cold restarts**
128
+ - Zero delay in rebuilding large indexes
129
 
130
  ---
 
131
 
132
+ # Folder Structure (MANDATORY SECTION)
133
+
134
+ src/
135
+ doc_service/
136
+ app.py
137
+ utils.py
138
+ embed_service/
139
+ app.py
140
+ embedder.py
141
+ cache_manager.py
142
+ search_service/
143
+ app.py
144
+ indexer.py
145
+ explain_service/
146
+ app.py
147
+ explainer.py
148
+ api_gateway/
149
+ app.py
150
+ ui/
151
+ streamlit_app.py
152
+ data/
153
+ docs/
154
+ <all .txt documents>
155
+ cache/
156
+ faiss_index.bin
157
+ faiss_meta.pkl
158
+ requirements.txt
159
+ Dockerfile
160
+ start.sh
161
+ README.md
162
 
 
 
 
 
163
 
164
+ ---
165
+
166
+ # How to Run Embedding Generation
167
+
168
+ Embeddings generate automatically during initialization:
169
+
170
+
171
+ Pipeline:
172
+
173
+ 1. **doc_service** → load + clean + hash
174
+ 2. **embed_service** → create or load cached embeddings
175
+ 3. **search_service** → FAISS index build or load
176
+ 4. Return summary
177
+
178
 
179
  ---
180
 
181
+ # How to Start the API
182
+
183
+ All services are launched using:
184
+
185
+ ```bash
186
+ bash start.sh
187
+
188
+ This starts:
189
+
190
+ 9001 → doc_service
191
+
192
+ 9002 → embed_service
193
+
194
+ 9003 → search_service
195
+
196
+ 9004 → explain_service
197
+
198
+ 8000 → api_gateway
199
+
200
+ 7860 → Streamlit UI
201
+
202
+
203
+ ## Architecture Overview
204
+
205
+ ### High-level Flow
206
+
207
+ 1. User asks a question in **Streamlit UI**
208
+ 2. UI sends request → **API Gateway** `/search`
209
+ 3. Gateway:
210
+ - Embeds query via **Embed Service**
211
+ - Searches FAISS via **Search Service**
212
+ - Fetches full doc text from **Doc Service**
213
+ - Gets explanation from **Explain Service**
214
+ 4. Response returned to UI with:
215
+ - filename, score, preview, full text
216
+ - keyword overlap, overlap ratio
217
+ - top matching sentences
218
+ - optional LLM explanation
219
+
220
+
221
+
222
+
223
  ## Design Choices
224
 
225
  ### 1️⃣ **Microservices instead of Monolithic**
 
278
 
279
 
280
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
281
 
282