Spaces:
Sleeping
Sleeping
File size: 7,407 Bytes
819233c 9530eca 819233c 9530eca 8a5f06b 0714800 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 0714800 9530eca 0714800 9530eca 0714800 9530eca 0714800 9530eca 0714800 9530eca 0714800 9530eca 0714800 8a5f06b 9530eca 8a5f06b 9530eca 6618f27 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b 9530eca 8a5f06b f1b87ed 6618f27 53f96c8 ed6bb44 f75a850 ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e ed6bb44 f7e541e f1b87ed 53f96c8 9530eca f1b87ed 8a5f06b 9530eca e4ef8e2 798d6f7 9530eca 798d6f7 9530eca d0dddfc 24a76a0 a251481 24a76a0 b7f70b9 24a76a0 b7f70b9 24a76a0 b7f70b9 24a76a0 6618f27 24a76a0 ff8f39c b7f70b9 6618f27 b7f70b9 6618f27 b7f70b9 ff8f39c 6618f27 24a76a0 ff8f39c b7f70b9 24a76a0 8a5f06b 0714800 3f41e4d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
---
title: Document Search Engine
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: "0.0.0"
app_file: start.sh
pinned: false
---
# Multi-Document Semantic Search Engine
A **production-inspired multi-microservice semantic search system** built over 20+ text documents.
Designed with:
- **Sentence-Transformers** (`all-MiniLM-L6-v2`)
- **Local Embedding Cache**
- **FAISS Vector Search + Persistent Storage**
- **LLM-Driven Explanations (Gemini 2.5 Flash)**
- **Google-Gemini-Style Streamlit UI**
- **Real Microservice Architecture**
- **Full Evaluation Suite (Accuracy Β· MRR Β· nDCG)**
A complete end-to-end ML system demonstrating real-world architecture & search engineering.
---
# Features
## πΉ Core Search
- Embedding-based semantic search over `.txt` documents
- FAISS `IndexFlatL2` on **normalized vectors** (β cosine similarity)
- Top-K ranking + similarity scores
- Keyword overlap, overlap ratio
- Top semantic sentences
- Full-text preview
---
## πΉ Microservice Architecture (5 FastAPI Services)
Each component runs as an **independent microservice**, mirroring real production systems:
| Service | Responsibility |
|--------|----------------|
| **doc_service** | Load, clean, normalize, hash, and store documents |
| **embed_service** | MiniLM embedding generation + caching |
| **search_service** | FAISS index build, update, and vector search |
| **explain_service** | Keyword overlap, top sentences, LLM explanations |
| **api_gateway** | Orchestration: a clean unified API for the UI |
| **streamlit_ui** | Gemini-style user interface |
This separation supports **scalability**, **fault isolation**, and **independent service upgrades** β *like real enterprise ML platforms*.
---
## πΉ Explanations
Every search result includes:
- **Keyword overlap**
- **Semantic overlap ratio**
- **Top relevant sentences (MiniLM sentence similarity)**
- **LLM-generated explanation**:
βWhy did this document match your query?β
---
## πΉ Evaluation Suite
A built-in evaluation workflow providing:
- **Accuracy**
- **MRR (Mean Reciprocal Rank)**
- **nDCG@K**
- Correct vs Incorrect queries
- Per-query detailed table
---
# How Caching Works (MANDATORY SECTION)
Caching happens inside **`embed_service/cache_manager.py`**.
### β Zero repeated embeddings
Each document is fingerprinted using:
- **filename**
- **MD5(cleaned_text)**
If the hash matches a previously stored file:
- cached embedding is loaded instantly
- prevents costly re-embedding
- improves startup & query latency
### Cache Files:
- `cache/embed_meta.json` β mapping of filename β `{hash, index}`
- `cache/embeddings.npy` β matrix of all embeddings
### Benefits
- Startup: **5β10 seconds β <1 second**
- Low compute cost
- Ideal for Hugging Face Spaces
- Guarantees reproducible results
---
# FAISS Persistence (Warm Start Optimization)
This project saves BOTH embeddings and FAISS index:
- `cache/embeddings.npy`
- `cache/embed_meta.json`
- `faiss_index.bin`
- `faiss_meta.pkl`
On startup:search_service.indexer.try_load()
If found β loaded instantly.
If not β FAISS index is rebuilt from cached embeddings.
### Why this matters?
- Makes FAISS behave like a **persistent vector database**
- Extremely important for **Docker**, **Spaces**, and **cold restarts**
- Zero delay in rebuilding large indexes
---
# Folder Structure
```
βββ src
βββ .github
β βββ workflows
β βββ hf-space-deploy.yml # GitHub Action β Deploy to Hugging Face Space
β βββ doc_service
β β βββ init.py
β β βββ app.py
β β βββ utils.py
β β
β βββ embed_service
β β βββ init.py
β β βββ app.py
β β βββ embedder.py
β β βββ cache_manager.py
β β
β βββ search_service
β β βββ init.py
β β βββ app.py
β β βββ indexer.py
β β
β βββ explain_service
β β βββ init.py
β β βββ app.py
β β βββ explainer.py
β β
β βββ api_gateway
β β βββ init.py
β β βββ app.py
β β
β βββ ui
β βββ streamlit_app.py
β
βββ data
β βββ docs
β βββ (150 .txt documents from 10 categories 30 each directly loaded into HF spaces)
β
βββ cache
β βββ embed_meta.json
β βββ embeddings.npy
β βββ faiss_index.bin
β βββ faiss_meta.pkl
β
βββ eval
β βββ evaluate.py
βββgenerated_queries.json
βββ start.sh
βββ Dockerfile
βββ requirements.txt
βββ .gitignore
βββ README.md
```
---
# How to Run Embedding Generation
Embeddings generate automatically during initialization:
Pipeline:
1. **doc_service** β load + clean + hash
2. **embed_service** β create or load cached embeddings
3. **search_service** β FAISS index build or load
4. Return summary
---
# How to Start the API
All services are launched using:
```bash
bash start.sh
This starts:
9001 β doc_service
9002 β embed_service
9003 β search_service
9004 β explain_service
8000 β api_gateway
7860 β Streamlit UI
```
---
## Architecture Overview
### High-level Flow
1. User asks a question in **Streamlit UI**
2. UI sends request β **API Gateway** `/search`
3. Gateway:
- Embeds query via **Embed Service**
- Searches FAISS via **Search Service**
- Fetches full doc text from **Doc Service**
- Gets explanation from **Explain Service**
4. Response returned to UI with:
- filename, score, preview, full text
- keyword overlap, overlap ratio
- top matching sentences
- optional LLM explanation
---
## Design Choices
### 1οΈβ£ **Microservices instead of Monolithic**
- Real-world ML systems separate **indexing, embedding, routing, and inference**.
- Enables **independent scaling**, easier debugging, and service-level isolation.
---
### 2οΈβ£ **MiniLM Embeddings**
- **Fast on CPU** (optimized for lightweight inference)
- **High semantic quality** for short & long text
- **Small model** β ideal for search engines, mobile, Spaces deployments
---
### 3οΈβ£ **FAISS L2 on Normalized Embeddings**
L2 distance is used instead of cosine because:
- **FAISS FlatL2 is faster** and more optimized
- When vectors are normalized:
`L2 Distance β‘ Cosine Distance` (mathematically equivalent)
- Avoids the overhead of cosine kernels
---
### 4οΈβ£ **Local Embedding Cache**
- Reduces startup time from **~5 seconds β <1 second**
- Prevents **re-embedding identical documents**
-Allows FAISS persistence to work smoothly
- Speeds up startup & indexing
---
### 5οΈβ£FAISS Persistence (Warm Start Optimization)
- Eliminates the need to rebuild index on each startup
- Warm-loads instantly at startup
- Ideal for Spaces & Docker environments
- A lightweight vector-database
---
### 6οΈβ£ **LLM-Driven Explainability**
- Generates **human-friendly reasoning**. Makes search results more interpretable and intelligent.
- Explains **why a document matched your query**
- Combines:
- Top semantic-matching sentences
- Keyword overlap
- Geminiβs natural-language reasoning
---
### 7οΈβ£ **Streamlit for Fast UI**
- Instant reload during development
- Clean layout
- Easy to extend (evaluation panel, metrics, expanders)
|