Sathvik-kota commited on
Commit
0714800
Β·
verified Β·
1 Parent(s): 8e41bc0

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +108 -1
README.md CHANGED
@@ -10,4 +10,111 @@ pinned: false
10
  ---
11
 
12
 
13
- # Multi-document-Embedding-Search-Engine-with-Caching
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
 
13
+ ---
14
+ title: Multi-Document Semantic Search Engine
15
+ emoji: πŸ”
16
+ colorFrom: blue
17
+ colorTo: purple
18
+ sdk: docker
19
+ app_file: start.sh
20
+ pinned: false
21
+ ---
22
+
23
+ # πŸ” Multi-Document Semantic Search Engine (Gemini-Style UI)
24
+
25
+ A **microservice-based** semantic search engine over 20 Newsgroups-style text documents with:
26
+
27
+ - Sentence-Transformers embeddings (`all-MiniLM-L6-v2`)
28
+ - **Local caching** (no repeated embedding computation)
29
+ - **FAISS** vector index (L2 on normalized embeddings)
30
+ - **LLM-powered explanations** (Gemini 2.5 Flash, optional)
31
+ - **Streamlit UI** styled like **Google Gemini**
32
+ - Full **evaluation suite** (Accuracy, MRR, nDCG, per-query breakdown)
33
+
34
+ ---
35
+
36
+ ## πŸš€ Features
37
+
38
+ ### πŸ”Ή Core Search
39
+
40
+ - Embedding-based semantic search over `.txt` docs
41
+ - FAISS `IndexFlatL2` on normalized vectors (β‰ˆ cosine similarity)
42
+ - Top-K ranking + score display
43
+ - Keyword overlap, overlap ratio, top matching sentences
44
+
45
+ ### πŸ”Ή Microservice Architecture (Your Big Idea πŸ’‘)
46
+
47
+ Each logical component runs as a **separate FastAPI microservice**:
48
+
49
+ - `doc_service` – loads & preprocesses documents
50
+ - `embed_service` – generates + caches embeddings
51
+ - `search_service` – maintains FAISS index & vector search
52
+ - `explain_service` – gives explanations (keywords + Gemini LLM)
53
+ - `api_gateway` – orchestrates everything behind a clean API
54
+ - `streamlit_ui` – user-facing Gemini-style search app
55
+
56
+ This mimics **real-world production** architectures and is a strong talking point in interviews.
57
+
58
+ ### πŸ”Ή Explanations
59
+
60
+ For each search result you get:
61
+
62
+ - βœ… Why this document was matched (LLM explanation)
63
+ - βœ… Which keywords overlapped (simple heuristic)
64
+ - βœ… Overlap ratio (0–1)
65
+ - βœ… Top matching sentences (semantic similarity)
66
+
67
+ ---
68
+
69
+ ## πŸ—οΈ Architecture Overview
70
+
71
+ ### High-level Flow
72
+
73
+ 1. User asks a question in **Streamlit UI**
74
+ 2. UI sends request β†’ **API Gateway** `/search`
75
+ 3. Gateway:
76
+ - Embeds query via **Embed Service**
77
+ - Searches FAISS via **Search Service**
78
+ - Fetches full doc text from **Doc Service**
79
+ - Gets explanation from **Explain Service**
80
+ 4. Response returned to UI with:
81
+ - filename, score, preview, full text
82
+ - keyword overlap, overlap ratio
83
+ - top matching sentences
84
+ - optional LLM explanation
85
+
86
+ ### ASCII Diagram (Microservices Highlighted)
87
+
88
+ ```text
89
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
90
+ β”‚ Streamlit UI β”‚
91
+ β”‚ (Gemini-style frontend) β”‚
92
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
93
+ β”‚ HTTP /search
94
+ β–Ό
95
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
96
+ β”‚ API Gateway β”‚ ← central orchestrator
97
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
98
+ β”‚ β”‚
99
+ Load docsβ”‚ β”‚Explanations
100
+ β”‚ β–Ό
101
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
102
+ β”‚ DOC SERVICE β”‚ β”‚ EXPLAIN SERVICE β”‚
103
+ β”‚ - read .txt β”‚ β”‚ - keywords/overlap β”‚
104
+ β”‚ - clean + hash β”‚ β”‚ - top sentences β”‚
105
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”˜ β”‚ - optional Gemini β”‚
106
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
107
+ β”‚ Embeddings β”‚
108
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
109
+ β”‚ EMBED SERVICE β”‚ β”‚
110
+ β”‚ - MiniLM embeddings β”‚ β”‚
111
+ β”‚ - caching to disk β”‚ β”‚
112
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
113
+ β”‚ vectors β”‚
114
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
115
+ β”‚ SEARCH SERVICE β”‚ β”‚
116
+ β”‚ - FAISS index (L2) β”‚ β”‚
117
+ β”‚ - Top-K search β”‚ β”‚
118
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
119
+ β”‚
120
+ ───────── All behind API GATEWAY + Streamlit UI ─────────