--- title: Document Search Engine emoji: πŸ“„ colorFrom: blue colorTo: purple sdk: docker sdk_version: "0.0.0" app_file: start.sh pinned: false --- # Multi-Document Semantic Search Engine A **production-inspired multi-microservice semantic search system** built over 20+ text documents. Designed with: - **Sentence-Transformers** (`all-MiniLM-L6-v2`) - **Local Embedding Cache** - **FAISS Vector Search + Persistent Storage** - **LLM-Driven Explanations (Gemini 2.5 Flash)** - **Google-Gemini-Style Streamlit UI** - **Real Microservice Architecture** - **Full Evaluation Suite (Accuracy Β· MRR Β· nDCG)** A complete end-to-end ML system demonstrating real-world architecture & search engineering. --- # Features ## πŸ”Ή Core Search - Embedding-based semantic search over `.txt` documents - FAISS `IndexFlatL2` on **normalized vectors** (β‰ˆ cosine similarity) - Top-K ranking + similarity scores - Keyword overlap, overlap ratio - Top semantic sentences - Full-text preview --- ## πŸ”Ή Microservice Architecture (5 FastAPI Services) Each component runs as an **independent microservice**, mirroring real production systems: | Service | Responsibility | |--------|----------------| | **doc_service** | Load, clean, normalize, hash, and store documents | | **embed_service** | MiniLM embedding generation + caching | | **search_service** | FAISS index build, update, and vector search | | **explain_service** | Keyword overlap, top sentences, LLM explanations | | **api_gateway** | Orchestration: a clean unified API for the UI | | **streamlit_ui** | Gemini-style user interface | This separation supports **scalability**, **fault isolation**, and **independent service upgrades** β€” *like real enterprise ML platforms*. --- ## πŸ”Ή Explanations Every search result includes: - **Keyword overlap** - **Semantic overlap ratio** - **Top relevant sentences (MiniLM sentence similarity)** - **LLM-generated explanation**: β€œWhy did this document match your query?” --- ## πŸ”Ή Evaluation Suite A built-in evaluation workflow providing: - **Accuracy** - **MRR (Mean Reciprocal Rank)** - **nDCG@K** - Correct vs Incorrect queries - Per-query detailed table --- # How Caching Works (MANDATORY SECTION) Caching happens inside **`embed_service/cache_manager.py`**. ### βœ” Zero repeated embeddings Each document is fingerprinted using: - **filename** - **MD5(cleaned_text)** If the hash matches a previously stored file: - cached embedding is loaded instantly - prevents costly re-embedding - improves startup & query latency ### Cache Files: - `cache/embed_meta.json` β†’ mapping of filename β†’ `{hash, index}` - `cache/embeddings.npy` β†’ matrix of all embeddings ### Benefits - Startup: **5–10 seconds β†’ <1 second** - Low compute cost - Ideal for Hugging Face Spaces - Guarantees reproducible results --- # FAISS Persistence (Warm Start Optimization) This project saves BOTH embeddings and FAISS index: - `cache/embeddings.npy` - `cache/embed_meta.json` - `faiss_index.bin` - `faiss_meta.pkl` On startup:search_service.indexer.try_load() If found β†’ loaded instantly. If not β†’ FAISS index is rebuilt from cached embeddings. ### Why this matters? - Makes FAISS behave like a **persistent vector database** - Extremely important for **Docker**, **Spaces**, and **cold restarts** - Zero delay in rebuilding large indexes --- # Folder Structure ``` β”œβ”€β”€ src β”œβ”€β”€ .github β”‚ └── workflows β”‚ └── hf-space-deploy.yml # GitHub Action β†’ Deploy to Hugging Face Space β”‚ β”œβ”€β”€ doc_service β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ app.py β”‚ β”‚ └── utils.py β”‚ β”‚ β”‚ β”œβ”€β”€ embed_service β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ app.py β”‚ β”‚ β”œβ”€β”€ embedder.py β”‚ β”‚ └── cache_manager.py β”‚ β”‚ β”‚ β”œβ”€β”€ search_service β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ app.py β”‚ β”‚ └── indexer.py β”‚ β”‚ β”‚ β”œβ”€β”€ explain_service β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ app.py β”‚ β”‚ └── explainer.py β”‚ β”‚ β”‚ β”œβ”€β”€ api_gateway β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ └── app.py β”‚ β”‚ β”‚ └── ui β”‚ └── streamlit_app.py β”‚ β”œβ”€β”€ data β”‚ └── docs β”‚ └── (150 .txt documents from 10 categories 30 each directly loaded into HF spaces) β”‚ β”œβ”€β”€ cache β”‚ β”œβ”€β”€ embed_meta.json β”‚ β”œβ”€β”€ embeddings.npy β”‚ β”œβ”€β”€ faiss_index.bin β”‚ └── faiss_meta.pkl β”‚ β”œβ”€β”€ eval β”‚ β”œβ”€β”€ evaluate.py │──generated_queries.json β”œβ”€β”€ start.sh β”œβ”€β”€ Dockerfile β”œβ”€β”€ requirements.txt β”œβ”€β”€ .gitignore └── README.md ``` --- # How to Run Embedding Generation Embeddings generate automatically during initialization: Pipeline: 1. **doc_service** β†’ load + clean + hash 2. **embed_service** β†’ create or load cached embeddings 3. **search_service** β†’ FAISS index build or load 4. Return summary --- # How to Start the API All services are launched using: ```bash bash start.sh This starts: 9001 β†’ doc_service 9002 β†’ embed_service 9003 β†’ search_service 9004 β†’ explain_service 8000 β†’ api_gateway 7860 β†’ Streamlit UI ``` --- ## Architecture Overview ### High-level Flow 1. User asks a question in **Streamlit UI** 2. UI sends request β†’ **API Gateway** `/search` 3. Gateway: - Embeds query via **Embed Service** - Searches FAISS via **Search Service** - Fetches full doc text from **Doc Service** - Gets explanation from **Explain Service** 4. Response returned to UI with: - filename, score, preview, full text - keyword overlap, overlap ratio - top matching sentences - optional LLM explanation --- ## Design Choices ### 1️⃣ **Microservices instead of Monolithic** - Real-world ML systems separate **indexing, embedding, routing, and inference**. - Enables **independent scaling**, easier debugging, and service-level isolation. --- ### 2️⃣ **MiniLM Embeddings** - **Fast on CPU** (optimized for lightweight inference) - **High semantic quality** for short & long text - **Small model** β†’ ideal for search engines, mobile, Spaces deployments --- ### 3️⃣ **FAISS L2 on Normalized Embeddings** L2 distance is used instead of cosine because: - **FAISS FlatL2 is faster** and more optimized - When vectors are normalized: `L2 Distance ≑ Cosine Distance` (mathematically equivalent) - Avoids the overhead of cosine kernels --- ### 4️⃣ **Local Embedding Cache** - Reduces startup time from **~5 seconds β†’ <1 second** - Prevents **re-embedding identical documents** -Allows FAISS persistence to work smoothly - Speeds up startup & indexing --- ### 5️⃣FAISS Persistence (Warm Start Optimization) - Eliminates the need to rebuild index on each startup - Warm-loads instantly at startup - Ideal for Spaces & Docker environments - A lightweight vector-database --- ### 6️⃣ **LLM-Driven Explainability** - Generates **human-friendly reasoning**. Makes search results more interpretable and intelligent. - Explains **why a document matched your query** - Combines: - Top semantic-matching sentences - Keyword overlap - Gemini’s natural-language reasoning --- ### 7️⃣ **Streamlit for Fast UI** - Instant reload during development - Clean layout - Easy to extend (evaluation panel, metrics, expanders)