Spaces:

airabbitX
/

mongo-vector-search-util

Sleeping

App Files Files Community

mongo-vector-search-util / README.md

airabbitX

Update README.md

787933d verified about 1 year ago

preview code

raw

history blame contribute delete

2.82 kB

	---
	license: agpl-3.0
	sdk: gradio
	---
	# Vector Search Demo App

	This is a Gradio web application that demonstrates vector search capabilities using MongoDB Atlas and OpenAI embeddings.

	## Prerequisites

	1. MongoDB Atlas account with vector search enabled
	2. OpenAI API key
	3. Python 3.8+
	4. Sample movie data loaded in MongoDB Atlas (sample_mflix database)

	## Setup

	1. Clone this repository

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Set up environment variables:
	```bash
	export OPENAI_API_KEY="your-openai-api-key"
	export ATLAS_URI="your-mongodb-atlas-connection-string"
	```

	4. Ensure your MongoDB Atlas setup:
	- Database name: sample_mflix
	- Collection: embedded_movies
	- Vector search index: idx_plot_embedding
	- Index configuration:
	```json
	{
	"fields": [
	{
	"type": "vector",
	"path": "plot_embedding",
	"numDimensions": 1536,
	"similarity": "dotProduct"
	}
	]
	}
	```

	## Running the App

	Start the application:
	```bash
	python app.py
	```

	The app will be available at http://localhost:7860

	## Usage

	### Generating Embeddings
	1. Select your database and collection from the dropdowns
	2. Choose the field to generate embeddings for
	3. Specify the embedding field name (defaults to "embedding")
	4. Set a document limit (0 for all documents)
	5. Click "Generate Embeddings" to start processing

	The app uses memory-efficient cursor-based batch processing that can handle large collections:
	- Documents are processed in batches (default 20 documents per batch)
	- Memory usage is optimized through cursor-based iteration
	- Real-time progress tracking shows completed/total documents
	- Supports processing of large collections (100,000+ documents)
	- Automatically resumes from where it left off if embeddings already exist

	### Searching
	1. Enter a natural language query in the text box (e.g., "humans fighting aliens")
	2. Click "Submit" to search
	3. View the results showing matching documents with their similarity scores

	## Example Queries

	- "humans fighting aliens"
	- "relationship drama between two good friends"
	- "comedy about family vacation"
	- "detective solving mysterious murder"

	## Performance Notes

	The application is optimized for handling large datasets:
	- Uses cursor-based batch processing to avoid memory issues
	- Processes documents in configurable batch sizes (default: 20)
	- Implements parallel processing with ThreadPoolExecutor
	- Provides real-time progress tracking
	- Automatically handles memory cleanup during processing
	- Supports resuming interrupted operations

	## Notes

	- The search uses OpenAI's text-embedding-ada-002 model to create embeddings
	- Results are limited to top 5 matches
	- Similarity scores range from 0 to 1, with higher scores indicating better matches