Spaces:

Sathvik-kota
/

Docfetch

Sleeping

App Files Files Community

Docfetch / README.md

Sathvik-kota

Upload folder using huggingface_hub

f75a850 verified 2 months ago

preview code

raw

history blame contribute delete

7.41 kB

	---
	title: Document Search Engine
	emoji: 📄
	colorFrom: blue
	colorTo: purple
	sdk: docker
	sdk_version: "0.0.0"
	app_file: start.sh
	pinned: false
	---

	# Multi-Document Semantic Search Engine
	A production-inspired multi-microservice semantic search system built over 20+ text documents.

	Designed with:

	- Sentence-Transformers (`all-MiniLM-L6-v2`)
	- Local Embedding Cache
	- FAISS Vector Search + Persistent Storage
	- LLM-Driven Explanations (Gemini 2.5 Flash)
	- Google-Gemini-Style Streamlit UI
	- Real Microservice Architecture
	- Full Evaluation Suite (Accuracy · MRR · nDCG)

	A complete end-to-end ML system demonstrating real-world architecture & search engineering.

	---

	# Features

	## 🔹 Core Search

	- Embedding-based semantic search over `.txt` documents
	- FAISS `IndexFlatL2` on normalized vectors (≈ cosine similarity)
	- Top-K ranking + similarity scores
	- Keyword overlap, overlap ratio
	- Top semantic sentences
	- Full-text preview

	---

	## 🔹 Microservice Architecture (5 FastAPI Services)

	Each component runs as an independent microservice, mirroring real production systems:

	\| Service \| Responsibility \|
	\|--------\|----------------\|
	\| doc_service \| Load, clean, normalize, hash, and store documents \|
	\| embed_service \| MiniLM embedding generation + caching \|
	\| search_service \| FAISS index build, update, and vector search \|
	\| explain_service \| Keyword overlap, top sentences, LLM explanations \|
	\| api_gateway \| Orchestration: a clean unified API for the UI \|
	\| streamlit_ui \| Gemini-style user interface \|

	This separation supports scalability, fault isolation, and independent service upgrades — like real enterprise ML platforms.

	---

	## 🔹 Explanations

	Every search result includes:

	- Keyword overlap
	- Semantic overlap ratio
	- Top relevant sentences (MiniLM sentence similarity)
	- LLM-generated explanation:
	“Why did this document match your query?”

	---

	## 🔹 Evaluation Suite
	A built-in evaluation workflow providing:

	- Accuracy
	- MRR (Mean Reciprocal Rank)
	- nDCG@K
	- Correct vs Incorrect queries
	- Per-query detailed table


	---

	# How Caching Works (MANDATORY SECTION)

	Caching happens inside `embed_service/cache_manager.py`.

	### ✔ Zero repeated embeddings
	Each document is fingerprinted using:

	- filename
	- MD5(cleaned_text)

	If the hash matches a previously stored file:
	- cached embedding is loaded instantly
	- prevents costly re-embedding
	- improves startup & query latency

	### Cache Files:
	- `cache/embed_meta.json` → mapping of filename → `{hash, index}`
	- `cache/embeddings.npy` → matrix of all embeddings

	### Benefits
	- Startup: 5–10 seconds → <1 second
	- Low compute cost
	- Ideal for Hugging Face Spaces
	- Guarantees reproducible results

	---

	# FAISS Persistence (Warm Start Optimization)

	This project saves BOTH embeddings and FAISS index:

	- `cache/embeddings.npy`
	- `cache/embed_meta.json`
	- `faiss_index.bin`
	- `faiss_meta.pkl`

	On startup:search_service.indexer.try_load()


	If found → loaded instantly.
	If not → FAISS index is rebuilt from cached embeddings.

	### Why this matters?
	- Makes FAISS behave like a persistent vector database
	- Extremely important for Docker, Spaces, and cold restarts
	- Zero delay in rebuilding large indexes

	---

	# Folder Structure
	```
	├── src
	├── .github
	│ └── workflows
	│ └── hf-space-deploy.yml # GitHub Action → Deploy to Hugging Face Space

	│ ├── doc_service
	│ │ ├── init.py
	│ │ ├── app.py
	│ │ └── utils.py
	│ │
	│ ├── embed_service
	│ │ ├── init.py
	│ │ ├── app.py
	│ │ ├── embedder.py
	│ │ └── cache_manager.py
	│ │
	│ ├── search_service
	│ │ ├── init.py
	│ │ ├── app.py
	│ │ └── indexer.py
	│ │
	│ ├── explain_service
	│ │ ├── init.py
	│ │ ├── app.py
	│ │ └── explainer.py
	│ │
	│ ├── api_gateway
	│ │ ├── init.py
	│ │ └── app.py
	│ │
	│ └── ui
	│ └── streamlit_app.py
	│
	├── data
	│ └── docs
	│ └── (150 .txt documents from 10 categories 30 each directly loaded into HF spaces)
	│
	├── cache
	│ ├── embed_meta.json
	│ ├── embeddings.npy
	│ ├── faiss_index.bin
	│ └── faiss_meta.pkl
	│
	├── eval
	│ ├── evaluate.py

	│──generated_queries.json
	├── start.sh
	├── Dockerfile
	├── requirements.txt
	├── .gitignore
	└── README.md

	```
	---

	# How to Run Embedding Generation

	Embeddings generate automatically during initialization:


	Pipeline:

	1. doc_service → load + clean + hash
	2. embed_service → create or load cached embeddings
	3. search_service → FAISS index build or load
	4. Return summary


	---

	# How to Start the API

	All services are launched using:

	```bash
	bash start.sh

	This starts:

	9001 → doc_service

	9002 → embed_service

	9003 → search_service

	9004 → explain_service

	8000 → api_gateway

	7860 → Streamlit UI
	```
	---

	## Architecture Overview

	### High-level Flow

	1. User asks a question in Streamlit UI
	2. UI sends request → API Gateway `/search`
	3. Gateway:
	- Embeds query via Embed Service
	- Searches FAISS via Search Service
	- Fetches full doc text from Doc Service
	- Gets explanation from Explain Service
	4. Response returned to UI with:
	- filename, score, preview, full text
	- keyword overlap, overlap ratio
	- top matching sentences
	- optional LLM explanation



	---

	## Design Choices

	### 1️⃣ Microservices instead of Monolithic
	- Real-world ML systems separate indexing, embedding, routing, and inference.
	- Enables independent scaling, easier debugging, and service-level isolation.


	---

	### 2️⃣ MiniLM Embeddings
	- Fast on CPU (optimized for lightweight inference)
	- High semantic quality for short & long text
	- Small model → ideal for search engines, mobile, Spaces deployments

	---

	### 3️⃣ FAISS L2 on Normalized Embeddings
	L2 distance is used instead of cosine because:

	- FAISS FlatL2 is faster and more optimized
	- When vectors are normalized:
	`L2 Distance ≡ Cosine Distance` (mathematically equivalent)
	- Avoids the overhead of cosine kernels

	---

	### 4️⃣ Local Embedding Cache
	- Reduces startup time from ~5 seconds → <1 second
	- Prevents re-embedding identical documents
	-Allows FAISS persistence to work smoothly
	- Speeds up startup & indexing
	---
	### 5️⃣FAISS Persistence (Warm Start Optimization)
	- Eliminates the need to rebuild index on each startup
	- Warm-loads instantly at startup
	- Ideal for Spaces & Docker environments
	- A lightweight vector-database
	---
	### 6️⃣ LLM-Driven Explainability
	- Generates human-friendly reasoning. Makes search results more interpretable and intelligent.
	- Explains why a document matched your query
	- Combines:
	- Top semantic-matching sentences
	- Keyword overlap
	- Gemini’s natural-language reasoning

	---

	### 7️⃣ Streamlit for Fast UI
	- Instant reload during development
	- Clean layout
	- Easy to extend (evaluation panel, metrics, expanders)