Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -10,4 +10,111 @@ pinned: false
|
|
| 10 |
---
|
| 11 |
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
|
| 13 |
+
---
|
| 14 |
+
title: Multi-Document Semantic Search Engine
|
| 15 |
+
emoji: π
|
| 16 |
+
colorFrom: blue
|
| 17 |
+
colorTo: purple
|
| 18 |
+
sdk: docker
|
| 19 |
+
app_file: start.sh
|
| 20 |
+
pinned: false
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
# π Multi-Document Semantic Search Engine (Gemini-Style UI)
|
| 24 |
+
|
| 25 |
+
A **microservice-based** semantic search engine over 20 Newsgroups-style text documents with:
|
| 26 |
+
|
| 27 |
+
- Sentence-Transformers embeddings (`all-MiniLM-L6-v2`)
|
| 28 |
+
- **Local caching** (no repeated embedding computation)
|
| 29 |
+
- **FAISS** vector index (L2 on normalized embeddings)
|
| 30 |
+
- **LLM-powered explanations** (Gemini 2.5 Flash, optional)
|
| 31 |
+
- **Streamlit UI** styled like **Google Gemini**
|
| 32 |
+
- Full **evaluation suite** (Accuracy, MRR, nDCG, per-query breakdown)
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## π Features
|
| 37 |
+
|
| 38 |
+
### πΉ Core Search
|
| 39 |
+
|
| 40 |
+
- Embedding-based semantic search over `.txt` docs
|
| 41 |
+
- FAISS `IndexFlatL2` on normalized vectors (β cosine similarity)
|
| 42 |
+
- Top-K ranking + score display
|
| 43 |
+
- Keyword overlap, overlap ratio, top matching sentences
|
| 44 |
+
|
| 45 |
+
### πΉ Microservice Architecture (Your Big Idea π‘)
|
| 46 |
+
|
| 47 |
+
Each logical component runs as a **separate FastAPI microservice**:
|
| 48 |
+
|
| 49 |
+
- `doc_service` β loads & preprocesses documents
|
| 50 |
+
- `embed_service` β generates + caches embeddings
|
| 51 |
+
- `search_service` β maintains FAISS index & vector search
|
| 52 |
+
- `explain_service` β gives explanations (keywords + Gemini LLM)
|
| 53 |
+
- `api_gateway` β orchestrates everything behind a clean API
|
| 54 |
+
- `streamlit_ui` β user-facing Gemini-style search app
|
| 55 |
+
|
| 56 |
+
This mimics **real-world production** architectures and is a strong talking point in interviews.
|
| 57 |
+
|
| 58 |
+
### πΉ Explanations
|
| 59 |
+
|
| 60 |
+
For each search result you get:
|
| 61 |
+
|
| 62 |
+
- β
Why this document was matched (LLM explanation)
|
| 63 |
+
- β
Which keywords overlapped (simple heuristic)
|
| 64 |
+
- β
Overlap ratio (0β1)
|
| 65 |
+
- β
Top matching sentences (semantic similarity)
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## ποΈ Architecture Overview
|
| 70 |
+
|
| 71 |
+
### High-level Flow
|
| 72 |
+
|
| 73 |
+
1. User asks a question in **Streamlit UI**
|
| 74 |
+
2. UI sends request β **API Gateway** `/search`
|
| 75 |
+
3. Gateway:
|
| 76 |
+
- Embeds query via **Embed Service**
|
| 77 |
+
- Searches FAISS via **Search Service**
|
| 78 |
+
- Fetches full doc text from **Doc Service**
|
| 79 |
+
- Gets explanation from **Explain Service**
|
| 80 |
+
4. Response returned to UI with:
|
| 81 |
+
- filename, score, preview, full text
|
| 82 |
+
- keyword overlap, overlap ratio
|
| 83 |
+
- top matching sentences
|
| 84 |
+
- optional LLM explanation
|
| 85 |
+
|
| 86 |
+
### ASCII Diagram (Microservices Highlighted)
|
| 87 |
+
|
| 88 |
+
```text
|
| 89 |
+
ββββββββββββββββββββββββββββ
|
| 90 |
+
β Streamlit UI β
|
| 91 |
+
β (Gemini-style frontend) β
|
| 92 |
+
ββββββββββββββ¬ββββββββββββββ
|
| 93 |
+
β HTTP /search
|
| 94 |
+
βΌ
|
| 95 |
+
ββββββββββββββββββββββββββββ
|
| 96 |
+
β API Gateway β β central orchestrator
|
| 97 |
+
βββββββββ¬βββββββββ¬βββββββββ
|
| 98 |
+
β β
|
| 99 |
+
Load docsβ βExplanations
|
| 100 |
+
β βΌ
|
| 101 |
+
ββββββββββββββββΌββββ βββββββββββββββββββββββ
|
| 102 |
+
β DOC SERVICE β β EXPLAIN SERVICE β
|
| 103 |
+
β - read .txt β β - keywords/overlap β
|
| 104 |
+
β - clean + hash β β - top sentences β
|
| 105 |
+
βββββββββββββ²βββββββ β - optional Gemini β
|
| 106 |
+
β βββββββββββ²ββββββββββββ
|
| 107 |
+
β Embeddings β
|
| 108 |
+
βββββββββββββ΄ββββββββββββ β
|
| 109 |
+
β EMBED SERVICE β β
|
| 110 |
+
β - MiniLM embeddings β β
|
| 111 |
+
β - caching to disk β β
|
| 112 |
+
βββββββββββββ²ββββββββββββ β
|
| 113 |
+
β vectors β
|
| 114 |
+
βββββββββββββ΄ββββββββββββ β
|
| 115 |
+
β SEARCH SERVICE β β
|
| 116 |
+
β - FAISS index (L2) β β
|
| 117 |
+
β - Top-K search β β
|
| 118 |
+
βββββββββββββββββββββββββ β
|
| 119 |
+
β
|
| 120 |
+
βββββββββ All behind API GATEWAY + Streamlit UI βββββββββ
|