Monimoy commited on
Commit
614f1e2
Β·
verified Β·
1 Parent(s): 4cb5165

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +45 -5
  2. gitignore +45 -0
  3. main.py +464 -0
  4. quick_start.sh +52 -0
  5. requirements.txt +7 -0
README.md CHANGED
@@ -1,11 +1,51 @@
1
  ---
2
  title: Simple Search Engine
3
- emoji: 🌍
4
- colorFrom: yellow
5
- colorTo: gray
6
  sdk: docker
 
7
  pinned: false
8
- short_description: Simple Search Engine
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Simple Search Engine
3
+ emoji: πŸ”
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
 
9
  ---
10
 
11
+ # Simple Search Engine πŸ”
12
+
13
+ An intelligent document search engine powered by Sentence Transformers (SBERT) and FastAPI.
14
+
15
+ ## Features
16
+
17
+ - **Semantic Search**: Uses the `all-MiniLM-L6-v2` model for understanding query context
18
+ - **Fast & Efficient**: Built with FastAPI for high performance
19
+ - **Beautiful UI**: Clean, modern interface with gradient design
20
+ - **Real-time Results**: Instant search results with similarity scores
21
+
22
+ ## How It Works
23
+
24
+ 1. Documents are chunked into smaller segments (3 sentences each)
25
+ 2. Each chunk is encoded using SBERT into vector embeddings
26
+ 3. User queries are encoded and compared using cosine similarity
27
+ 4. Top 5 most relevant chunks are returned with similarity scores
28
+
29
+ ## Technology Stack
30
+
31
+ - **Backend**: FastAPI
32
+ - **ML Model**: Sentence Transformers (all-MiniLM-L6-v2)
33
+ - **NLP**: NLTK for sentence tokenization
34
+ - **Similarity**: Scikit-learn for cosine similarity computation
35
+
36
+ ## Usage
37
+
38
+ Simply enter your search query in the search box and press Enter or click the Search button. The engine will return the top 5 most relevant document chunks with their similarity scores.
39
+
40
+ ## API Endpoints
41
+
42
+ - `GET /` - Web interface
43
+ - `POST /search` - Search endpoint (accepts JSON with `query` field)
44
+ - `GET /health` - Health check endpoint
45
+
46
+ ## Example Queries
47
+
48
+ - "machine learning AI"
49
+ - "cloud infrastructure AWS"
50
+ - "financial reports revenue"
51
+ - "marketing SEO strategies"
gitignore ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ env/
26
+ ENV/
27
+ .venv
28
+
29
+ # IDE
30
+ .vscode/
31
+ .idea/
32
+ *.swp
33
+ *.swo
34
+ *~
35
+
36
+ # OS
37
+ .DS_Store
38
+ Thumbs.db
39
+
40
+ # Model cache
41
+ .cache/
42
+ models/
43
+
44
+ # NLTK data
45
+ nltk_data/
main.py ADDED
@@ -0,0 +1,464 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # main.py - FastAPI Backend
2
+
3
+ from fastapi import FastAPI, HTTPException
4
+ from fastapi.middleware.cors import CORSMiddleware
5
+ from fastapi.staticfiles import StaticFiles
6
+ from fastapi.responses import HTMLResponse
7
+ from pydantic import BaseModel
8
+ import nltk
9
+ from nltk.tokenize import sent_tokenize
10
+ from sentence_transformers import SentenceTransformer
11
+ from sklearn.metrics.pairwise import cosine_similarity
12
+ import numpy as np
13
+
14
+ # Download required NLTK data
15
+ nltk.download('punkt', quiet=True)
16
+ nltk.download('punkt_tab')
17
+
18
+ # Initialize FastAPI app
19
+ app = FastAPI(title="Simple Search Engine")
20
+
21
+ # Add CORS middleware
22
+ app.add_middleware(
23
+ CORSMiddleware,
24
+ allow_origins=["*"],
25
+ allow_credentials=True,
26
+ allow_methods=["*"],
27
+ allow_headers=["*"],
28
+ )
29
+
30
+ # Define the document database
31
+ documents = {
32
+ "doc1": """
33
+ A new AI analytics tool has been released by TechCorp.
34
+ This tool uses advanced machine learning algorithms to process large datasets.
35
+ It can provide real-time insights and predictive analytics for businesses.
36
+ The tool integrates seamlessly with existing data infrastructure.
37
+ Companies can now make data-driven decisions faster than ever before.
38
+ The AI engine continuously learns from new data to improve accuracy.
39
+ """,
40
+
41
+ "doc2": """
42
+ The quarterly finance report shows strong revenue growth.
43
+ Operating expenses have decreased by 15% compared to last quarter.
44
+ Net profit margins have improved significantly across all divisions.
45
+ The company's cash flow remains healthy with substantial reserves.
46
+ Investment in new projects is expected to yield returns next year.
47
+ Shareholders can expect increased dividends this quarter.
48
+ """,
49
+
50
+ "doc3": """
51
+ Cloud infrastructure services from AWS and Azure are becoming essential.
52
+ Companies are migrating their legacy systems to the cloud for better scalability.
53
+ AWS offers a wide range of compute and storage options.
54
+ Azure provides excellent integration with Microsoft enterprise products.
55
+ Both platforms support hybrid cloud deployments for flexibility.
56
+ Security and compliance features are continuously being enhanced.
57
+ """,
58
+
59
+ "doc4": """
60
+ Our new marketing campaign focuses on SEO optimization strategies.
61
+ We are targeting high-value keywords to increase organic traffic.
62
+ Social media engagement has improved by 40% this month.
63
+ Content marketing efforts are driving more qualified leads.
64
+ The campaign includes email marketing and paid search ads.
65
+ We expect to see ROI improvements within the next quarter.
66
+ """,
67
+
68
+ "doc5": """
69
+ The AI tool leverages machine learning for predictive maintenance.
70
+ Machine learning models can detect patterns in equipment behavior.
71
+ This AI-powered solution reduces downtime and operational costs.
72
+ Deep learning techniques are applied to analyze sensor data.
73
+ The system continuously learns and adapts to new scenarios.
74
+ AI and machine learning are transforming industrial operations.
75
+ """
76
+ }
77
+
78
+ # Function to chunk documents
79
+ def chunk_documents(documents, sentences_per_chunk=3):
80
+ chunks = []
81
+ chunk_metadata = []
82
+
83
+ for doc_id, text in documents.items():
84
+ sentences = sent_tokenize(text.strip())
85
+ for i in range(0, len(sentences), sentences_per_chunk):
86
+ chunk = ' '.join(sentences[i:i+sentences_per_chunk])
87
+ chunks.append(chunk)
88
+ chunk_metadata.append({
89
+ 'doc_id': doc_id,
90
+ 'chunk_index': i // sentences_per_chunk,
91
+ 'text': chunk
92
+ })
93
+
94
+ return chunks, chunk_metadata
95
+
96
+ # Initialize model and process documents at startup
97
+ print("Initializing search engine...")
98
+ model = SentenceTransformer('all-MiniLM-L6-v2')
99
+ chunks, chunk_metadata = chunk_documents(documents)
100
+ chunk_embeddings = model.encode(chunks)
101
+ print(f"Search engine ready! {len(chunks)} chunks indexed.")
102
+
103
+ # Pydantic models
104
+ class SearchQuery(BaseModel):
105
+ query: str
106
+
107
+ class SearchResult(BaseModel):
108
+ rank: int
109
+ doc_id: str
110
+ similarity_score: float
111
+ text: str
112
+
113
+ # API Endpoints
114
+ @app.get("/")
115
+ async def read_root():
116
+ html_content = """
117
+ <!DOCTYPE html>
118
+ <html lang="en">
119
+ <head>
120
+ <meta charset="UTF-8">
121
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
122
+ <title>Simple Search Engine</title>
123
+ <style>
124
+ * {
125
+ margin: 0;
126
+ padding: 0;
127
+ box-sizing: border-box;
128
+ }
129
+
130
+ body {
131
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
132
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
133
+ min-height: 100vh;
134
+ padding: 20px;
135
+ }
136
+
137
+ .container {
138
+ max-width: 900px;
139
+ margin: 0 auto;
140
+ }
141
+
142
+ .header {
143
+ text-align: center;
144
+ color: white;
145
+ margin-bottom: 40px;
146
+ padding-top: 60px;
147
+ }
148
+
149
+ .header h1 {
150
+ font-size: 3em;
151
+ margin-bottom: 10px;
152
+ text-shadow: 2px 2px 4px rgba(0,0,0,0.3);
153
+ }
154
+
155
+ .header p {
156
+ font-size: 1.2em;
157
+ opacity: 0.9;
158
+ }
159
+
160
+ .search-box {
161
+ background: white;
162
+ border-radius: 50px;
163
+ padding: 10px 20px;
164
+ box-shadow: 0 8px 30px rgba(0,0,0,0.3);
165
+ display: flex;
166
+ align-items: center;
167
+ margin-bottom: 40px;
168
+ }
169
+
170
+ .search-box input {
171
+ flex: 1;
172
+ border: none;
173
+ outline: none;
174
+ font-size: 1.1em;
175
+ padding: 10px;
176
+ }
177
+
178
+ .search-box button {
179
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
180
+ border: none;
181
+ color: white;
182
+ padding: 12px 30px;
183
+ border-radius: 25px;
184
+ font-size: 1em;
185
+ cursor: pointer;
186
+ transition: transform 0.2s;
187
+ font-weight: bold;
188
+ }
189
+
190
+ .search-box button:hover {
191
+ transform: scale(1.05);
192
+ }
193
+
194
+ .search-box button:active {
195
+ transform: scale(0.95);
196
+ }
197
+
198
+ .loading {
199
+ text-align: center;
200
+ color: white;
201
+ font-size: 1.2em;
202
+ margin: 20px 0;
203
+ display: none;
204
+ }
205
+
206
+ .loading.show {
207
+ display: block;
208
+ }
209
+
210
+ .results {
211
+ display: none;
212
+ }
213
+
214
+ .results.show {
215
+ display: block;
216
+ }
217
+
218
+ .result-card {
219
+ background: white;
220
+ border-radius: 15px;
221
+ padding: 25px;
222
+ margin-bottom: 20px;
223
+ box-shadow: 0 4px 15px rgba(0,0,0,0.2);
224
+ transition: transform 0.2s, box-shadow 0.2s;
225
+ animation: slideIn 0.5s ease-out;
226
+ }
227
+
228
+ @keyframes slideIn {
229
+ from {
230
+ opacity: 0;
231
+ transform: translateY(20px);
232
+ }
233
+ to {
234
+ opacity: 1;
235
+ transform: translateY(0);
236
+ }
237
+ }
238
+
239
+ .result-card:hover {
240
+ transform: translateY(-5px);
241
+ box-shadow: 0 6px 25px rgba(0,0,0,0.3);
242
+ }
243
+
244
+ .result-header {
245
+ display: flex;
246
+ justify-content: space-between;
247
+ align-items: center;
248
+ margin-bottom: 15px;
249
+ }
250
+
251
+ .result-rank {
252
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
253
+ color: white;
254
+ padding: 5px 15px;
255
+ border-radius: 20px;
256
+ font-weight: bold;
257
+ font-size: 0.9em;
258
+ }
259
+
260
+ .result-doc {
261
+ color: #666;
262
+ font-size: 0.9em;
263
+ font-weight: 600;
264
+ }
265
+
266
+ .result-score {
267
+ background: #e8f5e9;
268
+ color: #2e7d32;
269
+ padding: 5px 12px;
270
+ border-radius: 15px;
271
+ font-size: 0.85em;
272
+ font-weight: bold;
273
+ }
274
+
275
+ .result-text {
276
+ color: #333;
277
+ line-height: 1.6;
278
+ font-size: 1em;
279
+ }
280
+
281
+ .no-results {
282
+ text-align: center;
283
+ color: white;
284
+ font-size: 1.2em;
285
+ margin-top: 40px;
286
+ display: none;
287
+ }
288
+
289
+ .no-results.show {
290
+ display: block;
291
+ }
292
+
293
+ .stats {
294
+ text-align: center;
295
+ color: white;
296
+ margin-bottom: 30px;
297
+ font-size: 1.1em;
298
+ opacity: 0.9;
299
+ }
300
+ </style>
301
+ </head>
302
+ <body>
303
+ <div class="container">
304
+ <div class="header">
305
+ <h1>πŸ” SimpleSearch</h1>
306
+ <p>Your intelligent document search engine</p>
307
+ </div>
308
+
309
+ <div class="search-box">
310
+ <input type="text" id="searchInput" placeholder="Search for documents..." />
311
+ <button onclick="performSearch()">Search</button>
312
+ </div>
313
+
314
+ <div class="loading" id="loading">
315
+ <p>πŸ”„ Searching...</p>
316
+ </div>
317
+
318
+ <div class="stats" id="stats"></div>
319
+
320
+ <div class="results" id="results"></div>
321
+
322
+ <div class="no-results" id="noResults">
323
+ <p>No results found. Try a different query!</p>
324
+ </div>
325
+ </div>
326
+
327
+ <script>
328
+ // Allow Enter key to trigger search
329
+ document.getElementById('searchInput').addEventListener('keypress', function(e) {
330
+ if (e.key === 'Enter') {
331
+ performSearch();
332
+ }
333
+ });
334
+
335
+ async function performSearch() {
336
+ const query = document.getElementById('searchInput').value.trim();
337
+
338
+ if (!query) {
339
+ alert('Please enter a search query!');
340
+ return;
341
+ }
342
+
343
+ // Show loading, hide results
344
+ document.getElementById('loading').classList.add('show');
345
+ document.getElementById('results').classList.remove('show');
346
+ document.getElementById('noResults').classList.remove('show');
347
+ document.getElementById('stats').innerHTML = '';
348
+
349
+ try {
350
+ const response = await fetch('/search', {
351
+ method: 'POST',
352
+ headers: {
353
+ 'Content-Type': 'application/json',
354
+ },
355
+ body: JSON.stringify({ query: query })
356
+ });
357
+
358
+ if (!response.ok) {
359
+ throw new Error('Search failed');
360
+ }
361
+
362
+ const data = await response.json();
363
+ displayResults(data, query);
364
+
365
+ } catch (error) {
366
+ console.error('Error:', error);
367
+ alert('Search failed. Please try again.');
368
+ } finally {
369
+ document.getElementById('loading').classList.remove('show');
370
+ }
371
+ }
372
+
373
+ function displayResults(results, query) {
374
+ const resultsDiv = document.getElementById('results');
375
+ const noResultsDiv = document.getElementById('noResults');
376
+ const statsDiv = document.getElementById('stats');
377
+
378
+ if (results.length === 0) {
379
+ noResultsDiv.classList.add('show');
380
+ return;
381
+ }
382
+
383
+ statsDiv.innerHTML = `Found <strong>${results.length}</strong> results for "<strong>${query}</strong>"`;
384
+
385
+ resultsDiv.innerHTML = '';
386
+
387
+ results.forEach(result => {
388
+ const card = document.createElement('div');
389
+ card.className = 'result-card';
390
+ card.style.animationDelay = `${(result.rank - 1) * 0.1}s`;
391
+
392
+ card.innerHTML = `
393
+ <div class="result-header">
394
+ <div style="display: flex; gap: 10px; align-items: center;">
395
+ <span class="result-rank">Rank ${result.rank}</span>
396
+ <span class="result-doc">${result.doc_id.toUpperCase()}</span>
397
+ </div>
398
+ <span class="result-score">Score: ${result.similarity_score.toFixed(4)}</span>
399
+ </div>
400
+ <div class="result-text">${result.text}</div>
401
+ `;
402
+
403
+ resultsDiv.appendChild(card);
404
+ });
405
+
406
+ resultsDiv.classList.add('show');
407
+ }
408
+ </script>
409
+ </body>
410
+ </html>
411
+ """
412
+ return HTMLResponse(content=html_content)
413
+
414
+ @app.post("/search", response_model=list[SearchResult])
415
+ async def search(search_query: SearchQuery):
416
+ """
417
+ Search endpoint that takes a query and returns top 5 relevant chunks
418
+ """
419
+ if not search_query.query.strip():
420
+ raise HTTPException(status_code=400, detail="Query cannot be empty")
421
+
422
+ try:
423
+ # Encode the query
424
+ query_embedding = model.encode([search_query.query])
425
+
426
+ # Calculate cosine similarity
427
+ similarities = cosine_similarity(query_embedding, chunk_embeddings)[0]
428
+
429
+ # Create results
430
+ results = []
431
+ for idx, score in enumerate(similarities):
432
+ results.append({
433
+ 'chunk_index': idx,
434
+ 'doc_id': chunk_metadata[idx]['doc_id'],
435
+ 'similarity_score': float(score),
436
+ 'text': chunk_metadata[idx]['text']
437
+ })
438
+
439
+ # Sort by similarity score
440
+ results_sorted = sorted(results, key=lambda x: x['similarity_score'], reverse=True)
441
+
442
+ # Return top 5 results
443
+ top_results = []
444
+ for rank, result in enumerate(results_sorted[:5], 1):
445
+ top_results.append(SearchResult(
446
+ rank=rank,
447
+ doc_id=result['doc_id'],
448
+ similarity_score=result['similarity_score'],
449
+ text=result['text']
450
+ ))
451
+
452
+ return top_results
453
+
454
+ except Exception as e:
455
+ raise HTTPException(status_code=500, detail=f"Search error: {str(e)}")
456
+
457
+ @app.get("/health")
458
+ async def health_check():
459
+ """Health check endpoint"""
460
+ return {"status": "healthy", "total_chunks": len(chunks)}
461
+
462
+ if __name__ == "__main__":
463
+ import uvicorn
464
+ uvicorn.run(app, host="0.0.0.0", port=8000)
quick_start.sh ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Quick Start Script for Local Testing
4
+ # Run this before deploying to Hugging Face to test locally
5
+
6
+ echo "πŸš€ Starting Simple Search Engine Setup..."
7
+ echo ""
8
+
9
+ # Check if Python is installed
10
+ if ! command -v python3 &> /dev/null; then
11
+ echo "❌ Python 3 is not installed. Please install Python 3.8 or higher."
12
+ exit 1
13
+ fi
14
+
15
+ echo "βœ… Python 3 found"
16
+
17
+ # Create virtual environment
18
+ echo "πŸ“¦ Creating virtual environment..."
19
+ python3 -m venv venv
20
+
21
+ # Activate virtual environment
22
+ echo "πŸ”§ Activating virtual environment..."
23
+ if [[ "$OSTYPE" == "msys" || "$OSTYPE" == "win32" ]]; then
24
+ # Windows
25
+ source venv/Scripts/activate
26
+ else
27
+ # Linux/Mac
28
+ source venv/bin/activate
29
+ fi
30
+
31
+ # Install requirements
32
+ echo "πŸ“₯ Installing dependencies..."
33
+ pip install -r requirements.txt
34
+
35
+ # Download NLTK data
36
+ echo "πŸ“š Downloading NLTK data..."
37
+ python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
38
+
39
+ echo ""
40
+ echo "βœ… Setup complete!"
41
+ echo ""
42
+ echo "🌐 To start the server locally, run:"
43
+ echo " python3 main.py"
44
+ echo ""
45
+ echo " or"
46
+ echo ""
47
+ echo " uvicorn main:app --reload --host 0.0.0.0 --port 8000"
48
+ echo ""
49
+ echo "πŸ“± Then open your browser to: http://localhost:8000"
50
+ echo ""
51
+ echo "πŸš€ Ready to deploy to Hugging Face? Follow the DEPLOYMENT_GUIDE.md"
52
+ echo ""
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ fastapi
2
+ uvicorn
3
+ sentence-transformers
4
+ scikit-learn
5
+ nltk
6
+ numpy
7
+ pydantic