ChromaDB Refresh Feature Documentation
Overview
The ChromaDB refresh feature allows you to automatically delete and recreate your local vector database on application startup. This is useful when you add new recipe files or update existing content that needs to be re-indexed.
Configuration
Environment Variables
Add the following to your .env file:
# Set to true to delete and recreate DB on startup (useful for adding new recipes)
DB_REFRESH_ON_START=false
Default: false (disabled)
Environment Files Updated
- β
.env- Your local configuration - β
.env.example- Template for new deployments - β
config/database.py- Configuration class updated - β
services/vector_store.py- Implementation added
How It Works
Normal Operation (DB_REFRESH_ON_START=false)
- Check if
DB_PERSIST_DIRECTORYexists - If exists with data β Load existing ChromaDB
- If empty/missing β Create new ChromaDB from recipe files
Refresh Mode (DB_REFRESH_ON_START=true)
- Check if
DB_PERSIST_DIRECTORYexists - If exists β Delete entire directory π¨
- Create new ChromaDB from recipe files in
./data/recipes/ - All data is re-indexed with current embedding model
Usage Examples
Adding New Recipes
# 1. Add new recipe files to ./data/recipes/
cp new_recipes.json ./data/recipes/
# 2. Enable refresh in .env
DB_REFRESH_ON_START=true
# 3. Start application (will recreate database)
uvicorn app:app --reload
# 4. Disable refresh (IMPORTANT!)
DB_REFRESH_ON_START=false
Changing Embedding Models
# 1. Change embedding provider in .env
EMBEDDING_PROVIDER=openai
OPENAI_EMBEDDING_MODEL=text-embedding-3-large
# 2. Enable refresh to rebuild vectors
DB_REFRESH_ON_START=true
# 3. Start application
uvicorn app:app --reload
# 4. Disable refresh
DB_REFRESH_ON_START=false
Troubleshooting Vector Issues
# If ChromaDB is corrupted or having issues
DB_REFRESH_ON_START=true
# Restart app to rebuild from scratch
Important Warnings β οΈ
Data Loss Warning
- Refresh DELETES ALL existing vector data
- This operation CANNOT be undone
- Always backup important data before refresh
Performance Impact
- Re-indexing takes time (depends on recipe count)
- Embedding API calls cost money (OpenAI, Google)
- Application startup will be slower during refresh
Memory Usage
- Large recipe datasets require more memory during indexing
- Monitor system resources during refresh
Best Practices
β DO
- Set
DB_REFRESH_ON_START=falseafter refresh completes - Test refresh in development before production
- Monitor logs during refresh process
- Add new recipes in batches if possible
β DON'T
- Leave refresh enabled in production
- Refresh unnecessarily (wastes resources)
- Interrupt refresh process (may corrupt data)
- Forget to disable after refresh
Monitoring and Logs
The refresh process is fully logged:
π DB_REFRESH_ON_START=true - Deleting existing ChromaDB at ./data/chromadb_persist
β
Existing ChromaDB deleted successfully
π Creating new ChromaDB at ./data/chromadb_persist
β
Created ChromaDB with 150 document chunks
Configuration Reference
Complete Environment Setup
# Vector Store Configuration
VECTOR_STORE_PROVIDER=chromadb
DB_PATH=./data/chromadb
DB_COLLECTION_NAME=recipes
DB_PERSIST_DIRECTORY=./data/chromadb_persist
# Refresh Control
DB_REFRESH_ON_START=false # Set to true only when needed
# Embedding Configuration
EMBEDDING_PROVIDER=huggingface
HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
Database Configuration Object
from config.database import DatabaseSettings
db_settings = DatabaseSettings()
config = db_settings.get_vector_store_config()
# Access refresh setting
refresh_enabled = config['refresh_on_start'] # boolean
Troubleshooting
Common Issues
Refresh not working:
- Check
.envfile hasDB_REFRESH_ON_START=true - Verify environment is loaded correctly
- Check file permissions on persist directory
Application won't start after refresh:
- Check recipe files exist in
./data/recipes/ - Verify embedding provider credentials
- Review application logs for specific errors
Partial refresh/corruption:
- Delete persist directory manually
- Set refresh=true and restart
- Check disk space availability
Emergency Recovery
If refresh fails or corrupts data:
# Manual cleanup
rm -rf ./data/chromadb_persist
# Reset configuration
DB_REFRESH_ON_START=true
# Restart application
uvicorn app:app --reload
Testing
Test the refresh functionality:
# Run refresh tests
python3 test_refresh.py
# Demo the feature
python3 demo_refresh.py
Implementation Details
Files Modified
config/database.py- Added
DB_REFRESH_ON_STARTenvironment variable - Updated
get_vector_store_config()method
- Added
services/vector_store.py- Added
shutilimport for directory deletion - Implemented refresh logic in
_get_or_create_vector_store() - Added comprehensive logging
- Added
Environment Files
- Updated
.envand.env.examplewith new variable - Added documentation comments
- Updated
Code Changes
# In vector_store.py
if refresh_on_start and persist_dir.exists():
logger.info(f"π DB_REFRESH_ON_START=true - Deleting existing ChromaDB at {persist_dir}")
shutil.rmtree(persist_dir)
logger.info(f"β
Existing ChromaDB deleted successfully")
This feature provides a simple but powerful way to manage vector database content lifecycle while maintaining data integrity and providing clear user control.