Spaces:

jessejohnson
/

plg4-dev-server

Paused

App Files Files Community

plg4-dev-server / backend /docs /chromadb_refresh.md

Jesse Johnson

New commit for backend deployment: 2025-09-25_13-24-03

c59d808 5 months ago

preview code

raw

history blame contribute delete

5.72 kB

ChromaDB Refresh Feature Documentation

Overview

The ChromaDB refresh feature allows you to automatically delete and recreate your local vector database on application startup. This is useful when you add new recipe files or update existing content that needs to be re-indexed.

Configuration

Environment Variables

Add the following to your .env file:

# Set to true to delete and recreate DB on startup (useful for adding new recipes)
DB_REFRESH_ON_START=false

Default: false (disabled)

Environment Files Updated

✅ .env - Your local configuration
✅ .env.example - Template for new deployments
✅ config/database.py - Configuration class updated
✅ services/vector_store.py - Implementation added

How It Works

Normal Operation (DB_REFRESH_ON_START=false)

Check if DB_PERSIST_DIRECTORY exists
If exists with data → Load existing ChromaDB
If empty/missing → Create new ChromaDB from recipe files

Refresh Mode (DB_REFRESH_ON_START=true)

Check if DB_PERSIST_DIRECTORY exists
If exists → Delete entire directory 🚨
Create new ChromaDB from recipe files in ./data/recipes/
All data is re-indexed with current embedding model

Usage Examples

Adding New Recipes

# 1. Add new recipe files to ./data/recipes/
cp new_recipes.json ./data/recipes/

# 2. Enable refresh in .env
DB_REFRESH_ON_START=true

# 3. Start application (will recreate database)
uvicorn app:app --reload

# 4. Disable refresh (IMPORTANT!)
DB_REFRESH_ON_START=false

Changing Embedding Models

# 1. Change embedding provider in .env
EMBEDDING_PROVIDER=openai
OPENAI_EMBEDDING_MODEL=text-embedding-3-large

# 2. Enable refresh to rebuild vectors
DB_REFRESH_ON_START=true

# 3. Start application
uvicorn app:app --reload

# 4. Disable refresh
DB_REFRESH_ON_START=false

Troubleshooting Vector Issues

# If ChromaDB is corrupted or having issues
DB_REFRESH_ON_START=true
# Restart app to rebuild from scratch

Important Warnings ⚠️

Data Loss Warning

Refresh DELETES ALL existing vector data
This operation CANNOT be undone
Always backup important data before refresh

Performance Impact

Re-indexing takes time (depends on recipe count)
Embedding API calls cost money (OpenAI, Google)
Application startup will be slower during refresh

Memory Usage

Large recipe datasets require more memory during indexing
Monitor system resources during refresh

Best Practices

✅ DO

Set DB_REFRESH_ON_START=false after refresh completes
Test refresh in development before production
Monitor logs during refresh process
Add new recipes in batches if possible

❌ DON'T

Leave refresh enabled in production
Refresh unnecessarily (wastes resources)
Interrupt refresh process (may corrupt data)
Forget to disable after refresh

Monitoring and Logs

The refresh process is fully logged:

🔄 DB_REFRESH_ON_START=true - Deleting existing ChromaDB at ./data/chromadb_persist
✅ Existing ChromaDB deleted successfully  
🆕 Creating new ChromaDB at ./data/chromadb_persist
✅ Created ChromaDB with 150 document chunks

Configuration Reference

Complete Environment Setup

# Vector Store Configuration
VECTOR_STORE_PROVIDER=chromadb
DB_PATH=./data/chromadb
DB_COLLECTION_NAME=recipes  
DB_PERSIST_DIRECTORY=./data/chromadb_persist

# Refresh Control
DB_REFRESH_ON_START=false  # Set to true only when needed

# Embedding Configuration  
EMBEDDING_PROVIDER=huggingface
HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

Database Configuration Object

from config.database import DatabaseSettings

db_settings = DatabaseSettings()
config = db_settings.get_vector_store_config()

# Access refresh setting
refresh_enabled = config['refresh_on_start']  # boolean

Troubleshooting

Common Issues

Refresh not working:

Check .env file has DB_REFRESH_ON_START=true
Verify environment is loaded correctly
Check file permissions on persist directory

Application won't start after refresh:

Check recipe files exist in ./data/recipes/
Verify embedding provider credentials
Review application logs for specific errors

Partial refresh/corruption:

Delete persist directory manually
Set refresh=true and restart
Check disk space availability

Emergency Recovery

If refresh fails or corrupts data:

# Manual cleanup
rm -rf ./data/chromadb_persist

# Reset configuration  
DB_REFRESH_ON_START=true

# Restart application
uvicorn app:app --reload

Testing

Test the refresh functionality:

# Run refresh tests
python3 test_refresh.py

# Demo the feature
python3 demo_refresh.py

Implementation Details

Files Modified

config/database.py
- Added DB_REFRESH_ON_START environment variable
- Updated get_vector_store_config() method
services/vector_store.py
- Added shutil import for directory deletion
- Implemented refresh logic in _get_or_create_vector_store()
- Added comprehensive logging
Environment Files
- Updated .env and .env.example with new variable
- Added documentation comments

Code Changes

# In vector_store.py
if refresh_on_start and persist_dir.exists():
    logger.info(f"🔄 DB_REFRESH_ON_START=true - Deleting existing ChromaDB at {persist_dir}")
    shutil.rmtree(persist_dir) 
    logger.info(f"✅ Existing ChromaDB deleted successfully")

This feature provides a simple but powerful way to manage vector database content lifecycle while maintaining data integrity and providing clear user control.