Spaces:
Sleeping
title: Document Search Engine
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: 0.0.0
app_file: start.sh
pinned: false
Multi-Document Semantic Search Engine
A production-inspired multi-microservice semantic search system built over 20+ text documents.
Designed with:
- Sentence-Transformers (
all-MiniLM-L6-v2) - Local Embedding Cache
- FAISS Vector Search + Persistent Storage
- LLM-Driven Explanations (Gemini 2.5 Flash)
- Google-Gemini-Style Streamlit UI
- Real Microservice Architecture
- Full Evaluation Suite (Accuracy Β· MRR Β· nDCG)
A complete end-to-end ML system demonstrating real-world architecture & search engineering.
Features
πΉ Core Search
- Embedding-based semantic search over
.txtdocuments - FAISS
IndexFlatL2on normalized vectors (β cosine similarity) - Top-K ranking + similarity scores
- Keyword overlap, overlap ratio
- Top semantic sentences
- Full-text preview
πΉ Microservice Architecture (5 FastAPI Services)
Each component runs as an independent microservice, mirroring real production systems:
| Service | Responsibility |
|---|---|
| doc_service | Load, clean, normalize, hash, and store documents |
| embed_service | MiniLM embedding generation + caching |
| search_service | FAISS index build, update, and vector search |
| explain_service | Keyword overlap, top sentences, LLM explanations |
| api_gateway | Orchestration: a clean unified API for the UI |
| streamlit_ui | Gemini-style user interface |
This separation supports scalability, fault isolation, and independent service upgrades β like real enterprise ML platforms.
πΉ Explanations
Every search result includes:
- Keyword overlap
- Semantic overlap ratio
- Top relevant sentences (MiniLM sentence similarity)
- LLM-generated explanation:
βWhy did this document match your query?β
πΉ Evaluation Suite
A built-in evaluation workflow providing:
- Accuracy
- MRR (Mean Reciprocal Rank)
- nDCG@K
- Correct vs Incorrect queries
- Per-query detailed table
How Caching Works (MANDATORY SECTION)
Caching happens inside embed_service/cache_manager.py.
β Zero repeated embeddings
Each document is fingerprinted using:
- filename
- MD5(cleaned_text)
If the hash matches a previously stored file:
- cached embedding is loaded instantly
- prevents costly re-embedding
- improves startup & query latency
Cache Files:
cache/embed_meta.jsonβ mapping of filename β{hash, index}cache/embeddings.npyβ matrix of all embeddings
Benefits
- Startup: 5β10 seconds β <1 second
- Low compute cost
- Ideal for Hugging Face Spaces
- Guarantees reproducible results
FAISS Persistence (Warm Start Optimization)
This project saves BOTH embeddings and FAISS index:
cache/embeddings.npycache/embed_meta.jsonfaiss_index.binfaiss_meta.pkl
On startup:search_service.indexer.try_load()
If found β loaded instantly.
If not β FAISS index is rebuilt from cached embeddings.
Why this matters?
- Makes FAISS behave like a persistent vector database
- Extremely important for Docker, Spaces, and cold restarts
- Zero delay in rebuilding large indexes
Folder Structure
βββ src
βββ .github
β βββ workflows
β βββ hf-space-deploy.yml # GitHub Action β Deploy to Hugging Face Space
β βββ doc_service
β β βββ init.py
β β βββ app.py
β β βββ utils.py
β β
β βββ embed_service
β β βββ init.py
β β βββ app.py
β β βββ embedder.py
β β βββ cache_manager.py
β β
β βββ search_service
β β βββ init.py
β β βββ app.py
β β βββ indexer.py
β β
β βββ explain_service
β β βββ init.py
β β βββ app.py
β β βββ explainer.py
β β
β βββ api_gateway
β β βββ init.py
β β βββ app.py
β β
β βββ ui
β βββ streamlit_app.py
β
βββ data
β βββ docs
β βββ (150 .txt documents from 10 categories 30 each directly loaded into HF spaces)
β
βββ cache
β βββ embed_meta.json
β βββ embeddings.npy
β βββ faiss_index.bin
β βββ faiss_meta.pkl
β
βββ eval
β βββ evaluate.py
βββgenerated_queries.json
βββ start.sh
βββ Dockerfile
βββ requirements.txt
βββ .gitignore
βββ README.md
How to Run Embedding Generation
Embeddings generate automatically during initialization:
Pipeline:
- doc_service β load + clean + hash
- embed_service β create or load cached embeddings
- search_service β FAISS index build or load
- Return summary
How to Start the API
All services are launched using:
bash start.sh
This starts:
9001 β doc_service
9002 β embed_service
9003 β search_service
9004 β explain_service
8000 β api_gateway
7860 β Streamlit UI
Architecture Overview
High-level Flow
- User asks a question in Streamlit UI
- UI sends request β API Gateway
/search - Gateway:
- Embeds query via Embed Service
- Searches FAISS via Search Service
- Fetches full doc text from Doc Service
- Gets explanation from Explain Service
- Response returned to UI with:
- filename, score, preview, full text
- keyword overlap, overlap ratio
- top matching sentences
- optional LLM explanation
Design Choices
1οΈβ£ Microservices instead of Monolithic
- Real-world ML systems separate indexing, embedding, routing, and inference.
- Enables independent scaling, easier debugging, and service-level isolation.
2οΈβ£ MiniLM Embeddings
- Fast on CPU (optimized for lightweight inference)
- High semantic quality for short & long text
- Small model β ideal for search engines, mobile, Spaces deployments
3οΈβ£ FAISS L2 on Normalized Embeddings
L2 distance is used instead of cosine because:
- FAISS FlatL2 is faster and more optimized
- When vectors are normalized:
L2 Distance β‘ Cosine Distance(mathematically equivalent) - Avoids the overhead of cosine kernels
4οΈβ£ Local Embedding Cache
- Reduces startup time from ~5 seconds β <1 second
- Prevents re-embedding identical documents -Allows FAISS persistence to work smoothly
- Speeds up startup & indexing
5οΈβ£FAISS Persistence (Warm Start Optimization)
- Eliminates the need to rebuild index on each startup
- Warm-loads instantly at startup
- Ideal for Spaces & Docker environments
- A lightweight vector-database
6οΈβ£ LLM-Driven Explainability
- Generates human-friendly reasoning. Makes search results more interpretable and intelligent.
- Explains why a document matched your query
- Combines:
- Top semantic-matching sentences
- Keyword overlap
- Geminiβs natural-language reasoning
7οΈβ£ Streamlit for Fast UI
- Instant reload during development
- Clean layout
- Easy to extend (evaluation panel, metrics, expanders)