File size: 5,719 Bytes
c59d808
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# ChromaDB Refresh Feature Documentation

## Overview

The ChromaDB refresh feature allows you to automatically delete and recreate your local vector database on application startup. This is useful when you add new recipe files or update existing content that needs to be re-indexed.

## Configuration

### Environment Variables

Add the following to your `.env` file:

```bash
# Set to true to delete and recreate DB on startup (useful for adding new recipes)
DB_REFRESH_ON_START=false
```

**Default:** `false` (disabled)

### Environment Files Updated

- βœ… `.env` - Your local configuration
- βœ… `.env.example` - Template for new deployments  
- βœ… `config/database.py` - Configuration class updated
- βœ… `services/vector_store.py` - Implementation added

## How It Works

### Normal Operation (DB_REFRESH_ON_START=false)
1. Check if `DB_PERSIST_DIRECTORY` exists
2. If exists with data β†’ Load existing ChromaDB
3. If empty/missing β†’ Create new ChromaDB from recipe files

### Refresh Mode (DB_REFRESH_ON_START=true)
1. Check if `DB_PERSIST_DIRECTORY` exists  
2. If exists β†’ **Delete entire directory** 🚨
3. Create new ChromaDB from recipe files in `./data/recipes/`
4. All data is re-indexed with current embedding model

## Usage Examples

### Adding New Recipes

```bash
# 1. Add new recipe files to ./data/recipes/
cp new_recipes.json ./data/recipes/

# 2. Enable refresh in .env
DB_REFRESH_ON_START=true

# 3. Start application (will recreate database)
uvicorn app:app --reload

# 4. Disable refresh (IMPORTANT!)
DB_REFRESH_ON_START=false
```

### Changing Embedding Models

```bash
# 1. Change embedding provider in .env
EMBEDDING_PROVIDER=openai
OPENAI_EMBEDDING_MODEL=text-embedding-3-large

# 2. Enable refresh to rebuild vectors
DB_REFRESH_ON_START=true

# 3. Start application
uvicorn app:app --reload

# 4. Disable refresh
DB_REFRESH_ON_START=false
```

### Troubleshooting Vector Issues

```bash
# If ChromaDB is corrupted or having issues
DB_REFRESH_ON_START=true
# Restart app to rebuild from scratch
```

## Important Warnings ⚠️

### Data Loss Warning
- **Refresh DELETES ALL existing vector data**
- **This operation CANNOT be undone**
- Always backup important data before refresh

### Performance Impact
- Re-indexing takes time (depends on recipe count)
- Embedding API calls cost money (OpenAI, Google)
- Application startup will be slower during refresh

### Memory Usage
- Large recipe datasets require more memory during indexing
- Monitor system resources during refresh

## Best Practices

### βœ… DO
- Set `DB_REFRESH_ON_START=false` after refresh completes
- Test refresh in development before production
- Monitor logs during refresh process
- Add new recipes in batches if possible

### ❌ DON'T  
- Leave refresh enabled in production
- Refresh unnecessarily (wastes resources)
- Interrupt refresh process (may corrupt data)
- Forget to disable after refresh

## Monitoring and Logs

The refresh process is fully logged:

```
πŸ”„ DB_REFRESH_ON_START=true - Deleting existing ChromaDB at ./data/chromadb_persist
βœ… Existing ChromaDB deleted successfully  
πŸ†• Creating new ChromaDB at ./data/chromadb_persist
βœ… Created ChromaDB with 150 document chunks
```

## Configuration Reference

### Complete Environment Setup

```bash
# Vector Store Configuration
VECTOR_STORE_PROVIDER=chromadb
DB_PATH=./data/chromadb
DB_COLLECTION_NAME=recipes  
DB_PERSIST_DIRECTORY=./data/chromadb_persist

# Refresh Control
DB_REFRESH_ON_START=false  # Set to true only when needed

# Embedding Configuration  
EMBEDDING_PROVIDER=huggingface
HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
```

### Database Configuration Object

```python
from config.database import DatabaseSettings

db_settings = DatabaseSettings()
config = db_settings.get_vector_store_config()

# Access refresh setting
refresh_enabled = config['refresh_on_start']  # boolean
```

## Troubleshooting

### Common Issues

**Refresh not working:**
- Check `.env` file has `DB_REFRESH_ON_START=true`
- Verify environment is loaded correctly
- Check file permissions on persist directory

**Application won't start after refresh:**
- Check recipe files exist in `./data/recipes/`
- Verify embedding provider credentials
- Review application logs for specific errors

**Partial refresh/corruption:**
- Delete persist directory manually
- Set refresh=true and restart
- Check disk space availability

### Emergency Recovery

If refresh fails or corrupts data:

```bash
# Manual cleanup
rm -rf ./data/chromadb_persist

# Reset configuration  
DB_REFRESH_ON_START=true

# Restart application
uvicorn app:app --reload
```

## Testing

Test the refresh functionality:

```bash
# Run refresh tests
python3 test_refresh.py

# Demo the feature
python3 demo_refresh.py
```

## Implementation Details

### Files Modified

1. **`config/database.py`**
   - Added `DB_REFRESH_ON_START` environment variable
   - Updated `get_vector_store_config()` method

2. **`services/vector_store.py`**  
   - Added `shutil` import for directory deletion
   - Implemented refresh logic in `_get_or_create_vector_store()`
   - Added comprehensive logging

3. **Environment Files**
   - Updated `.env` and `.env.example` with new variable
   - Added documentation comments

### Code Changes

```python
# In vector_store.py
if refresh_on_start and persist_dir.exists():
    logger.info(f"πŸ”„ DB_REFRESH_ON_START=true - Deleting existing ChromaDB at {persist_dir}")
    shutil.rmtree(persist_dir) 
    logger.info(f"βœ… Existing ChromaDB deleted successfully")
```

This feature provides a simple but powerful way to manage vector database content lifecycle while maintaining data integrity and providing clear user control.