# Safe Direct Directory Copy Guide for ChromaDB Merge ## ChromaDB File Structure Your current ChromaDB directory contains: ``` ./chroma_db/ ├── chroma.sqlite3 # Main database file (metadata index) └── [Collection UUID folders] # One folder per collection ├── 0808e537-cf80-4b64-a337-0f7ae74dc9d5/ ├── 24b5ff30-002d-43ff-ad39-44c87a8ac6d0/ ├── 40bcc7c3-167c-499e-b520-57cf8c723e28/ ├── 57a7a07d-2746-4b75-85f1-844a334871ba/ ├── 757cee3b-442d-4ebf-8fe7-ccd27a869786/ ├── 8d4abc47-3860-4eab-bcda-42aa64156c63/ ├── 98152a87-2e94-4077-be26-b9305747289f/ ├── c91615ef-3bc8-4667-a51f-9400741c7591/ ├── ec92fa49-44c7-4af8-8eb6-2d49a4ca6a82/ └── f333d54b-ad48-4ede-9479-1aab4d56f332/ ``` ## Critical Files to Copy ### **1. Main Database File (MUST COPY)** - **File:** `chroma.sqlite3` - **Purpose:** Stores collection metadata, document IDs, and embeddings references - **Size:** ~500 MB (in your case) - **Status:** ⚠️ **CRITICAL - DO NOT SKIP** ### **2. Collection Folders (MUST COPY)** - **Location:** UUID-named directories inside `./chroma_db/` - **Files per collection:** - `data_level0.bin` - Actual vector embeddings data - `header.bin` - Index header information - `index_metadata.pickle` - Index metadata - `length.bin` - Document length information - `link_lists.bin` - HNSW graph links (similarity search structure) - **Status:** ⚠️ **CRITICAL - DO NOT SKIP** --- ## Safe Copy Strategy ### **Phase 1: Backup (Protect Current Collections)** **Step 1: Stop the application** ```powershell # Stop Streamlit if running # Close terminal or press Ctrl+C ``` **Step 2: Create backup of current collections** ```powershell # Navigate to project directory cd "d:\CapStoneProject\RAG Capstone Project" # Create timestamped backup $timestamp = Get-Date -Format "yyyyMMdd_HHmmss" Copy-Item -Path ".\chroma_db" -Destination ".\chroma_db.backup_$timestamp" -Recurse # Verify backup was created Get-ChildItem -Path ".\chroma_db.backup_$timestamp" | Measure-Object | Select-Object Count ``` **Step 3: Verify backup integrity** ```powershell # List your current collections Get-ChildItem ".\chroma_db" -Directory | Where-Object {$_.Name -match "^[a-f0-9\-]{36}$"} | Measure-Object ``` --- ### **Phase 2: Copy External Collections** **Step 4: Copy external collection folders ONLY** **Option A: Copy Specific Collections (RECOMMENDED - Safest)** If external ChromaDB has UUID folders, identify which ones to copy: ```powershell # Copy only external collection folders (not chroma.sqlite3) $externalPath = "C:\path\to\external\chroma_db" $targetPath = ".\chroma_db" # Get external collection folder UUIDs $externalCollections = Get-ChildItem -Path $externalPath -Directory | Where-Object {$_.Name -match "^[a-f0-9\-]{36}$"} # Copy each one foreach ($collection in $externalCollections) { $sourceFolder = Join-Path $externalPath $collection.Name $destFolder = Join-Path $targetPath $collection.Name # Check if already exists if (Test-Path $destFolder) { Write-Host "⚠️ Collection $($collection.Name) already exists - SKIPPING" } else { Write-Host "📋 Copying collection $($collection.Name)..." Copy-Item -Path $sourceFolder -Destination $destFolder -Recurse -Force Write-Host "✅ Copied successfully" } } ``` **Option B: Copy Entire External ChromaDB (If confident)** ```powershell # Copy all folders from external (NOT the sqlite3 file initially) $externalPath = "C:\path\to\external\chroma_db" $targetPath = ".\chroma_db" # Copy all subdirectories Get-ChildItem -Path $externalPath -Directory | ForEach-Object { if ($_.Name -match "^[a-f0-9\-]{36}$") { # UUID format $destFolder = Join-Path $targetPath $_.Name if (-not (Test-Path $destFolder)) { Copy-Item -Path $_.FullName -Destination $destFolder -Recurse -Force Write-Host "✅ Copied $($_.Name)" } else { Write-Host "⚠️ Skipped $($_.Name) (already exists)" } } } ``` --- ### **Phase 3: Handle the SQLite Database** **⚠️ CRITICAL: DO NOT simply copy chroma.sqlite3** The `chroma.sqlite3` file contains metadata that references collections. If you copy it, you might lose existing collections or create conflicts. **Step 5: Merge SQLite Databases (Choose ONE approach)** #### **Option A: Let ChromaDB Rebuild the Index (SAFEST)** ```powershell # 1. Delete the old chroma.sqlite3 Remove-Item -Path ".\chroma_db\chroma.sqlite3" -Force # 2. Start your application - ChromaDB will rebuild it automatically # 3. ChromaDB will scan all collection folders and rebuild the metadata # Restart app: streamlit run streamlit_app.py ``` ChromaDB will detect the new collection folders and automatically register them in the new sqlite3 file. #### **Option B: Merge SQLite Files (ADVANCED)** Only if you want to preserve both old and new collections' metadata: ```powershell # This requires SQLite tools - install if needed # choco install sqlite # or: winget install sqlite # 1. Backup both sqlite3 files Copy-Item ".\chroma_db\chroma.sqlite3" -Destination ".\chroma_db\chroma.sqlite3.backup" Copy-Item "C:\path\to\external\chroma_db\chroma.sqlite3" -Destination ".\chroma_db\chroma.sqlite3.external.backup" # 2. Use SQLite merge (requires SQLite CLI knowledge) # This is complex - recommended only if you're familiar with SQL ``` --- ## Step-by-Step Safe Copy Process ### **Complete Workflow:** ``` 1. STOP APPLICATION └─ Close Streamlit 2. BACKUP CURRENT STATE └─ Copy entire ./chroma_db to ./chroma_db.backup_YYYYMMDD_HHMMSS 3. IDENTIFY EXTERNAL COLLECTIONS └─ Determine which collection UUID folders to copy 4. COPY EXTERNAL COLLECTION FOLDERS └─ Copy only UUID folders (NOT chroma.sqlite3) └─ Verify no naming conflicts └─ Skip if collection name already exists 5. REBUILD METADATA └─ Delete ./chroma_db/chroma.sqlite3 └─ OR restart application to rebuild automatically 6. START APPLICATION └─ streamlit run streamlit_app.py 7. VERIFY IN UI └─ Check "Existing Collections" dropdown └─ Should show original + new external collections 8. TEST COLLECTIONS └─ Load each collection └─ Run test queries └─ Verify retrieval works 9. CLEANUP (Optional) └─ Delete backup after verification ``` --- ## Files to Copy Summary | File/Folder | Copy? | Reason | Notes | |------------|-------|--------|-------| | `chroma.sqlite3` | ❌ NO | Conflicts | Let ChromaDB rebuild it | | UUID folders | ✅ YES | Collection data | Copy all new collections | | Other files | ❓ MAYBE | System files | Only if present in external | --- ## What NOT to Copy ❌ **Do NOT copy:** - `chroma.sqlite3` directly - System/temporary files - Old backup files from external ChromaDB - Configuration files from external project --- ## Verification Checklist After merge, verify: - ✅ Streamlit starts without errors - ✅ Old collections still appear in dropdown - ✅ New collections appear in dropdown - ✅ Can load any collection without error - ✅ Can query and retrieve documents - ✅ Retrieved documents have correct embeddings - ✅ Evaluation runs without errors - ✅ chroma.sqlite3 file exists and is up-to-date --- ## Troubleshooting ### **Problem: New collections don't appear** **Solution:** ```powershell # Delete sqlite3 and restart Remove-Item -Path ".\chroma_db\chroma.sqlite3" -Force # Restart Streamlit ``` ### **Problem: Old collections disappeared** **Restore from backup:** ```powershell $timestamp = "YYYYMMDD_HHMMSS" # Use your backup timestamp Remove-Item -Path ".\chroma_db" -Recurse -Force Rename-Item -Path ".\chroma_db.backup_$timestamp" -NewName "chroma_db" ``` ### **Problem: Collection name conflicts** **Resolution:** ```powershell # Rename the external collection folder before copying # UUID folders are internally referenced, so renaming the folder name # requires updating chroma.sqlite3 (complex) # BETTER: Use different collection name # In your project, import external collection with renamed name ``` ### **Problem: File permission errors** **Solution:** ```powershell # Run PowerShell as Administrator # Or check if files are locked by Streamlit process # Restart PowerShell in admin mode: Start-Process powershell -Verb RunAs ``` --- ## Safe Copy Command (Ready to Use) **For copying external collections safely:** ```powershell # Set paths $externalPath = "C:\path\to\external\chroma_db" # Update this $projectPath = "d:\CapStoneProject\RAG Capstone Project" $targetPath = "$projectPath\chroma_db" # Backup current $timestamp = Get-Date -Format "yyyyMMdd_HHmmss" Copy-Item -Path $targetPath -Destination "$projectPath\chroma_db.backup_$timestamp" -Recurse Write-Host "✅ Backup created: chroma_db.backup_$timestamp" # Copy external collections $count = 0 Get-ChildItem -Path $externalPath -Directory | Where-Object {$_.Name -match "^[a-f0-9\-]{36}$"} | ForEach-Object { $destFolder = Join-Path $targetPath $_.Name if (-not (Test-Path $destFolder)) { Copy-Item -Path $_.FullName -Destination $destFolder -Recurse -Force $count++ Write-Host "✅ Copied: $($_.Name)" } else { Write-Host "⏭️ Skipped: $($_.Name) (already exists)" } } Write-Host "" Write-Host "✅ Copy complete! Copied $count new collections" Write-Host "Next: Delete ./chroma_db/chroma.sqlite3 and restart application" ``` --- ## Summary **To safely merge with Direct Directory Copy:** 1. ✅ **Backup** your current `./chroma_db` 2. ✅ **Copy** only external collection UUID folders 3. ❌ **DO NOT copy** `chroma.sqlite3` 4. ✅ **Delete** old `chroma.sqlite3` (let ChromaDB rebuild) 5. ✅ **Restart** application 6. ✅ **Verify** all collections appear **Risk Level:** 🟢 **Low** (if you follow this guide) Your current collections are **protected** because: - You backup before starting - You don't overwrite sqlite3 - ChromaDB rebuilds the index safely - You can restore from backup anytime