Docfetch / README.md
Sathvik-kota's picture
Upload folder using huggingface_hub
f75a850 verified
metadata
title: Document Search Engine
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: 0.0.0
app_file: start.sh
pinned: false

Multi-Document Semantic Search Engine

A production-inspired multi-microservice semantic search system built over 20+ text documents.

Designed with:

  • Sentence-Transformers (all-MiniLM-L6-v2)
  • Local Embedding Cache
  • FAISS Vector Search + Persistent Storage
  • LLM-Driven Explanations (Gemini 2.5 Flash)
  • Google-Gemini-Style Streamlit UI
  • Real Microservice Architecture
  • Full Evaluation Suite (Accuracy Β· MRR Β· nDCG)

A complete end-to-end ML system demonstrating real-world architecture & search engineering.


Features

πŸ”Ή Core Search

  • Embedding-based semantic search over .txt documents
  • FAISS IndexFlatL2 on normalized vectors (β‰ˆ cosine similarity)
  • Top-K ranking + similarity scores
  • Keyword overlap, overlap ratio
  • Top semantic sentences
  • Full-text preview

πŸ”Ή Microservice Architecture (5 FastAPI Services)

Each component runs as an independent microservice, mirroring real production systems:

Service Responsibility
doc_service Load, clean, normalize, hash, and store documents
embed_service MiniLM embedding generation + caching
search_service FAISS index build, update, and vector search
explain_service Keyword overlap, top sentences, LLM explanations
api_gateway Orchestration: a clean unified API for the UI
streamlit_ui Gemini-style user interface

This separation supports scalability, fault isolation, and independent service upgrades β€” like real enterprise ML platforms.


πŸ”Ή Explanations

Every search result includes:

  • Keyword overlap
  • Semantic overlap ratio
  • Top relevant sentences (MiniLM sentence similarity)
  • LLM-generated explanation:
    β€œWhy did this document match your query?”

πŸ”Ή Evaluation Suite

A built-in evaluation workflow providing:

  • Accuracy
  • MRR (Mean Reciprocal Rank)
  • nDCG@K
  • Correct vs Incorrect queries
  • Per-query detailed table

How Caching Works (MANDATORY SECTION)

Caching happens inside embed_service/cache_manager.py.

βœ” Zero repeated embeddings

Each document is fingerprinted using:

  • filename
  • MD5(cleaned_text)

If the hash matches a previously stored file:

  • cached embedding is loaded instantly
  • prevents costly re-embedding
  • improves startup & query latency

Cache Files:

  • cache/embed_meta.json β†’ mapping of filename β†’ {hash, index}
  • cache/embeddings.npy β†’ matrix of all embeddings

Benefits

  • Startup: 5–10 seconds β†’ <1 second
  • Low compute cost
  • Ideal for Hugging Face Spaces
  • Guarantees reproducible results

FAISS Persistence (Warm Start Optimization)

This project saves BOTH embeddings and FAISS index:

  • cache/embeddings.npy
  • cache/embed_meta.json
  • faiss_index.bin
  • faiss_meta.pkl

On startup:search_service.indexer.try_load()

If found β†’ loaded instantly.
If not β†’ FAISS index is rebuilt from cached embeddings.

Why this matters?

  • Makes FAISS behave like a persistent vector database
  • Extremely important for Docker, Spaces, and cold restarts
  • Zero delay in rebuilding large indexes

Folder Structure

β”œβ”€β”€ src
β”œβ”€β”€  .github
β”‚ └── workflows
β”‚ └── hf-space-deploy.yml # GitHub Action β†’ Deploy to Hugging Face Space

β”‚ β”œβ”€β”€  doc_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ └── utils.py
β”‚ β”‚
β”‚ β”œβ”€β”€ embed_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ β”œβ”€β”€ embedder.py
β”‚ β”‚ └── cache_manager.py
β”‚ β”‚
β”‚ β”œβ”€β”€  search_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ └── indexer.py
β”‚ β”‚
β”‚ β”œβ”€β”€  explain_service
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ β”œβ”€β”€ app.py
β”‚ β”‚ └── explainer.py
β”‚ β”‚
β”‚ β”œβ”€β”€  api_gateway
β”‚ β”‚ β”œβ”€β”€ init.py
β”‚ β”‚ └── app.py
β”‚ β”‚
β”‚ └──  ui
β”‚ └── streamlit_app.py
β”‚
β”œβ”€β”€  data
β”‚ └──  docs
β”‚ └── (150  .txt documents from 10 categories 30 each directly loaded into HF spaces)
β”‚
β”œβ”€β”€  cache
β”‚ β”œβ”€β”€ embed_meta.json
β”‚ β”œβ”€β”€ embeddings.npy
β”‚ β”œβ”€β”€ faiss_index.bin
β”‚ └── faiss_meta.pkl
β”‚
β”œβ”€β”€  eval
β”‚ β”œβ”€β”€ evaluate.py

│──generated_queries.json
β”œβ”€β”€ start.sh
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md

How to Run Embedding Generation

Embeddings generate automatically during initialization:

Pipeline:

  1. doc_service β†’ load + clean + hash
  2. embed_service β†’ create or load cached embeddings
  3. search_service β†’ FAISS index build or load
  4. Return summary

How to Start the API

All services are launched using:

bash start.sh

This starts:

9001 β†’ doc_service

9002 β†’ embed_service

9003 β†’ search_service

9004 β†’ explain_service

8000 β†’ api_gateway

7860 β†’ Streamlit UI

Architecture Overview

High-level Flow

  1. User asks a question in Streamlit UI
  2. UI sends request β†’ API Gateway /search
  3. Gateway:
    • Embeds query via Embed Service
    • Searches FAISS via Search Service
    • Fetches full doc text from Doc Service
    • Gets explanation from Explain Service
  4. Response returned to UI with:
    • filename, score, preview, full text
    • keyword overlap, overlap ratio
    • top matching sentences
    • optional LLM explanation

Design Choices

1️⃣ Microservices instead of Monolithic

  • Real-world ML systems separate indexing, embedding, routing, and inference.
  • Enables independent scaling, easier debugging, and service-level isolation.

2️⃣ MiniLM Embeddings

  • Fast on CPU (optimized for lightweight inference)
  • High semantic quality for short & long text
  • Small model β†’ ideal for search engines, mobile, Spaces deployments

3️⃣ FAISS L2 on Normalized Embeddings

L2 distance is used instead of cosine because:

  • FAISS FlatL2 is faster and more optimized
  • When vectors are normalized:
    L2 Distance ≑ Cosine Distance (mathematically equivalent)
  • Avoids the overhead of cosine kernels

4️⃣ Local Embedding Cache

  • Reduces startup time from ~5 seconds β†’ <1 second
  • Prevents re-embedding identical documents -Allows FAISS persistence to work smoothly
  • Speeds up startup & indexing

5️⃣FAISS Persistence (Warm Start Optimization)

  • Eliminates the need to rebuild index on each startup
  • Warm-loads instantly at startup
  • Ideal for Spaces & Docker environments
  • A lightweight vector-database

6️⃣ LLM-Driven Explainability

  • Generates human-friendly reasoning. Makes search results more interpretable and intelligent.
  • Explains why a document matched your query
  • Combines:
    • Top semantic-matching sentences
    • Keyword overlap
    • Gemini’s natural-language reasoning

7️⃣ Streamlit for Fast UI

  • Instant reload during development
  • Clean layout
  • Easy to extend (evaluation panel, metrics, expanders)