Spaces:

Sathvik-kota
/

Docfetch

Sleeping

App Files Files Community

Docfetch / README.md

Sathvik-kota

Upload folder using huggingface_hub

f75a850 verified 2 months ago

preview code

raw

history blame contribute delete

7.41 kB

metadata

title: Document Search Engine
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: 0.0.0
app_file: start.sh
pinned: false

Multi-Document Semantic Search Engine

A production-inspired multi-microservice semantic search system built over 20+ text documents.

Designed with:

Sentence-Transformers (all-MiniLM-L6-v2)
Local Embedding Cache
FAISS Vector Search + Persistent Storage
LLM-Driven Explanations (Gemini 2.5 Flash)
Google-Gemini-Style Streamlit UI
Real Microservice Architecture
Full Evaluation Suite (Accuracy · MRR · nDCG)

A complete end-to-end ML system demonstrating real-world architecture & search engineering.

Features

🔹 Core Search

Embedding-based semantic search over .txt documents
FAISS IndexFlatL2 on normalized vectors (≈ cosine similarity)
Top-K ranking + similarity scores
Keyword overlap, overlap ratio
Top semantic sentences
Full-text preview

🔹 Microservice Architecture (5 FastAPI Services)

Each component runs as an independent microservice, mirroring real production systems:

Service	Responsibility
doc_service	Load, clean, normalize, hash, and store documents
embed_service	MiniLM embedding generation + caching
search_service	FAISS index build, update, and vector search
explain_service	Keyword overlap, top sentences, LLM explanations
api_gateway	Orchestration: a clean unified API for the UI
streamlit_ui	Gemini-style user interface

This separation supports scalability, fault isolation, and independent service upgrades — like real enterprise ML platforms.

🔹 Explanations

Every search result includes:

Keyword overlap
Semantic overlap ratio
Top relevant sentences (MiniLM sentence similarity)
LLM-generated explanation:
“Why did this document match your query?”

🔹 Evaluation Suite

A built-in evaluation workflow providing:

Accuracy
MRR (Mean Reciprocal Rank)
nDCG@K
Correct vs Incorrect queries
Per-query detailed table

How Caching Works (MANDATORY SECTION)

Caching happens inside embed_service/cache_manager.py.

✔ Zero repeated embeddings

Each document is fingerprinted using:

filename
MD5(cleaned_text)

If the hash matches a previously stored file:

cached embedding is loaded instantly
prevents costly re-embedding
improves startup & query latency

Cache Files:

cache/embed_meta.json → mapping of filename → {hash, index}
cache/embeddings.npy → matrix of all embeddings

Benefits

Startup: 5–10 seconds → <1 second
Low compute cost
Ideal for Hugging Face Spaces
Guarantees reproducible results

FAISS Persistence (Warm Start Optimization)

This project saves BOTH embeddings and FAISS index:

cache/embeddings.npy
cache/embed_meta.json
faiss_index.bin
faiss_meta.pkl

On startup:search_service.indexer.try_load()

If found → loaded instantly.
If not → FAISS index is rebuilt from cached embeddings.

Why this matters?

Makes FAISS behave like a persistent vector database
Extremely important for Docker, Spaces, and cold restarts
Zero delay in rebuilding large indexes

Folder Structure

├── src
├──  .github
│ └── workflows
│ └── hf-space-deploy.yml # GitHub Action → Deploy to Hugging Face Space

│ ├──  doc_service
│ │ ├── init.py
│ │ ├── app.py
│ │ └── utils.py
│ │
│ ├── embed_service
│ │ ├── init.py
│ │ ├── app.py
│ │ ├── embedder.py
│ │ └── cache_manager.py
│ │
│ ├──  search_service
│ │ ├── init.py
│ │ ├── app.py
│ │ └── indexer.py
│ │
│ ├──  explain_service
│ │ ├── init.py
│ │ ├── app.py
│ │ └── explainer.py
│ │
│ ├──  api_gateway
│ │ ├── init.py
│ │ └── app.py
│ │
│ └──  ui
│ └── streamlit_app.py
│
├──  data
│ └──  docs
│ └── (150  .txt documents from 10 categories 30 each directly loaded into HF spaces)
│
├──  cache
│ ├── embed_meta.json
│ ├── embeddings.npy
│ ├── faiss_index.bin
│ └── faiss_meta.pkl
│
├──  eval
│ ├── evaluate.py

│──generated_queries.json
├── start.sh
├── Dockerfile
├── requirements.txt
├── .gitignore
└── README.md

How to Run Embedding Generation

Embeddings generate automatically during initialization:

Pipeline:

doc_service → load + clean + hash
embed_service → create or load cached embeddings
search_service → FAISS index build or load
Return summary

How to Start the API

All services are launched using:

bash start.sh

This starts:

9001 → doc_service

9002 → embed_service

9003 → search_service

9004 → explain_service

8000 → api_gateway

7860 → Streamlit UI

Architecture Overview

High-level Flow

User asks a question in Streamlit UI
UI sends request → API Gateway /search
Gateway:
- Embeds query via Embed Service
- Searches FAISS via Search Service
- Fetches full doc text from Doc Service
- Gets explanation from Explain Service
Response returned to UI with:
- filename, score, preview, full text
- keyword overlap, overlap ratio
- top matching sentences
- optional LLM explanation

Design Choices

1️⃣ Microservices instead of Monolithic

Real-world ML systems separate indexing, embedding, routing, and inference.
Enables independent scaling, easier debugging, and service-level isolation.

2️⃣ MiniLM Embeddings

Fast on CPU (optimized for lightweight inference)
High semantic quality for short & long text
Small model → ideal for search engines, mobile, Spaces deployments

3️⃣ FAISS L2 on Normalized Embeddings

L2 distance is used instead of cosine because:

FAISS FlatL2 is faster and more optimized
When vectors are normalized:
L2 Distance ≡ Cosine Distance (mathematically equivalent)
Avoids the overhead of cosine kernels

4️⃣ Local Embedding Cache

Reduces startup time from ~5 seconds → <1 second
Prevents re-embedding identical documents -Allows FAISS persistence to work smoothly
Speeds up startup & indexing

5️⃣FAISS Persistence (Warm Start Optimization)

Eliminates the need to rebuild index on each startup
Warm-loads instantly at startup
Ideal for Spaces & Docker environments
A lightweight vector-database

6️⃣ LLM-Driven Explainability

Generates human-friendly reasoning. Makes search results more interpretable and intelligent.
Explains why a document matched your query
Combines:
- Top semantic-matching sentences
- Keyword overlap
- Gemini’s natural-language reasoning

7️⃣ Streamlit for Fast UI

Instant reload during development
Clean layout
Easy to extend (evaluation panel, metrics, expanders)