Docfetch / README.md
Sathvik-kota's picture
Upload folder using huggingface_hub
f75a850 verified
---
title: Document Search Engine
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: "0.0.0"
app_file: start.sh
pinned: false
---
# Multi-Document Semantic Search Engine
A **production-inspired multi-microservice semantic search system** built over 20+ text documents.
Designed with:
- **Sentence-Transformers** (`all-MiniLM-L6-v2`)
- **Local Embedding Cache**
- **FAISS Vector Search + Persistent Storage**
- **LLM-Driven Explanations (Gemini 2.5 Flash)**
- **Google-Gemini-Style Streamlit UI**
- **Real Microservice Architecture**
- **Full Evaluation Suite (Accuracy Β· MRR Β· nDCG)**
A complete end-to-end ML system demonstrating real-world architecture & search engineering.
---
# Features
## πŸ”Ή Core Search
- Embedding-based semantic search over `.txt` documents
- FAISS `IndexFlatL2` on **normalized vectors** (β‰ˆ cosine similarity)
- Top-K ranking + similarity scores
- Keyword overlap, overlap ratio
- Top semantic sentences
- Full-text preview
---
## πŸ”Ή Microservice Architecture (5 FastAPI Services)
Each component runs as an **independent microservice**, mirroring real production systems:
| Service | Responsibility |
|--------|----------------|
| **doc_service** | Load, clean, normalize, hash, and store documents |
| **embed_service** | MiniLM embedding generation + caching |
| **search_service** | FAISS index build, update, and vector search |
| **explain_service** | Keyword overlap, top sentences, LLM explanations |
| **api_gateway** | Orchestration: a clean unified API for the UI |
| **streamlit_ui** | Gemini-style user interface |
This separation supports **scalability**, **fault isolation**, and **independent service upgrades** β€” *like real enterprise ML platforms*.
---
## πŸ”Ή Explanations
Every search result includes:
- **Keyword overlap**
- **Semantic overlap ratio**
- **Top relevant sentences (MiniLM sentence similarity)**
- **LLM-generated explanation**:
β€œWhy did this document match your query?”
---
## πŸ”Ή Evaluation Suite
A built-in evaluation workflow providing:
- **Accuracy**
- **MRR (Mean Reciprocal Rank)**
- **nDCG@K**
- Correct vs Incorrect queries
- Per-query detailed table
---
# How Caching Works (MANDATORY SECTION)
Caching happens inside **`embed_service/cache_manager.py`**.
### βœ” Zero repeated embeddings
Each document is fingerprinted using:
- **filename**
- **MD5(cleaned_text)**
If the hash matches a previously stored file:
- cached embedding is loaded instantly
- prevents costly re-embedding
- improves startup & query latency
### Cache Files:
- `cache/embed_meta.json` β†’ mapping of filename β†’ `{hash, index}`
- `cache/embeddings.npy` β†’ matrix of all embeddings
### Benefits
- Startup: **5–10 seconds β†’ <1 second**
- Low compute cost
- Ideal for Hugging Face Spaces
- Guarantees reproducible results
---
# FAISS Persistence (Warm Start Optimization)
This project saves BOTH embeddings and FAISS index:
- `cache/embeddings.npy`
- `cache/embed_meta.json`
- `faiss_index.bin`
- `faiss_meta.pkl`
On startup:search_service.indexer.try_load()
If found β†’ loaded instantly.
If not β†’ FAISS index is rebuilt from cached embeddings.
### Why this matters?
- Makes FAISS behave like a **persistent vector database**
- Extremely important for **Docker**, **Spaces**, and **cold restarts**
- Zero delay in rebuilding large indexes
---
# Folder Structure
```
β”œβ”€β”€ src
β”œβ”€β”€ .github
β”‚ └── workflows
β”‚ └── hf-space-deploy.yml # GitHub Action β†’ Deploy to Hugging Face Space
β”‚ β”œβ”€β”€ doc_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ └── utils.py
β”‚ β”‚
β”‚ β”œβ”€β”€ embed_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ β”œβ”€β”€ embedder.py
β”‚ β”‚ └── cache_manager.py
β”‚ β”‚
β”‚ β”œβ”€β”€ search_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ └── indexer.py
β”‚ β”‚
β”‚ β”œβ”€β”€ explain_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ └── explainer.py
β”‚ β”‚
β”‚ β”œβ”€β”€ api_gateway
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ └── app.py
β”‚ β”‚
β”‚ └── ui
β”‚ └── streamlit_app.py
β”‚
β”œβ”€β”€ data
β”‚ └── docs
β”‚ └── (150 .txt documents from 10 categories 30 each directly loaded into HF spaces)
β”‚
β”œβ”€β”€ cache
β”‚ β”œβ”€β”€ embed_meta.json
β”‚ β”œβ”€β”€ embeddings.npy
β”‚ β”œβ”€β”€ faiss_index.bin
β”‚ └── faiss_meta.pkl
β”‚
β”œβ”€β”€ eval
β”‚ β”œβ”€β”€ evaluate.py
│──generated_queries.json
β”œβ”€β”€ start.sh
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md
```
---
# How to Run Embedding Generation
Embeddings generate automatically during initialization:
Pipeline:
1. **doc_service** β†’ load + clean + hash
2. **embed_service** β†’ create or load cached embeddings
3. **search_service** β†’ FAISS index build or load
4. Return summary
---
# How to Start the API
All services are launched using:
```bash
bash start.sh
This starts:
9001 β†’ doc_service
9002 β†’ embed_service
9003 β†’ search_service
9004 β†’ explain_service
8000 β†’ api_gateway
7860 β†’ Streamlit UI
```
---
## Architecture Overview
### High-level Flow
1. User asks a question in **Streamlit UI**
2. UI sends request β†’ **API Gateway** `/search`
3. Gateway:
- Embeds query via **Embed Service**
- Searches FAISS via **Search Service**
- Fetches full doc text from **Doc Service**
- Gets explanation from **Explain Service**
4. Response returned to UI with:
- filename, score, preview, full text
- keyword overlap, overlap ratio
- top matching sentences
- optional LLM explanation
---
## Design Choices
### 1️⃣ **Microservices instead of Monolithic**
- Real-world ML systems separate **indexing, embedding, routing, and inference**.
- Enables **independent scaling**, easier debugging, and service-level isolation.
---
### 2️⃣ **MiniLM Embeddings**
- **Fast on CPU** (optimized for lightweight inference)
- **High semantic quality** for short & long text
- **Small model** β†’ ideal for search engines, mobile, Spaces deployments
---
### 3️⃣ **FAISS L2 on Normalized Embeddings**
L2 distance is used instead of cosine because:
- **FAISS FlatL2 is faster** and more optimized
- When vectors are normalized:
`L2 Distance ≑ Cosine Distance` (mathematically equivalent)
- Avoids the overhead of cosine kernels
---
### 4️⃣ **Local Embedding Cache**
- Reduces startup time from **~5 seconds β†’ <1 second**
- Prevents **re-embedding identical documents**
-Allows FAISS persistence to work smoothly
- Speeds up startup & indexing
---
### 5️⃣FAISS Persistence (Warm Start Optimization)
- Eliminates the need to rebuild index on each startup
- Warm-loads instantly at startup
- Ideal for Spaces & Docker environments
- A lightweight vector-database
---
### 6️⃣ **LLM-Driven Explainability**
- Generates **human-friendly reasoning**. Makes search results more interpretable and intelligent.
- Explains **why a document matched your query**
- Combines:
- Top semantic-matching sentences
- Keyword overlap
- Gemini’s natural-language reasoning
---
### 7️⃣ **Streamlit for Fast UI**
- Instant reload during development
- Clean layout
- Easy to extend (evaluation panel, metrics, expanders)