CapStoneRAG10 / docs /CHROMADB_RECOVERY.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

ChromaDB Collection Recovery Guide

Problem

After deleting chroma.sqlite3:

  • βœ… Collection UUID folders still exist with all data files
  • βœ… chroma.sqlite3 was recreated automatically
  • ❌ Collections don't appear in the dropdown
  • ❌ ChromaDB can't see the collections

Root Cause: The new chroma.sqlite3 is empty - it doesn't have the metadata about which collections exist. ChromaDB doesn't auto-scan existing collection folders; it only knows about collections registered in sqlite3.


Solution 1: Restore from Backup (EASIEST) ⭐

If you created a backup before deleting sqlite3:

Step 1: Locate Backup

# List backups
Get-ChildItem ".\chroma_db.backup_*" -Directory

# Find the most recent one
Get-ChildItem ".\chroma_db.backup_*" -Directory | Sort-Object LastWriteTime -Descending | Select-Object -First 1

Step 2: Restore Backup

# Stop Streamlit (Ctrl+C)

# Remove current chroma_db
Remove-Item -Path ".\chroma_db" -Recurse -Force

# Restore from backup
$latestBackup = Get-ChildItem ".\chroma_db.backup_*" -Directory | Sort-Object LastWriteTime -Descending | Select-Object -First 1
Copy-Item -Path $latestBackup.FullName -Destination ".\chroma_db" -Recurse

# Restart Streamlit
streamlit run streamlit_app.py

Step 3: Verify

Collections should now appear in dropdown βœ…


Solution 2: Manually Rebuild SQLite Index (COMPLEX)

This requires directly using ChromaDB's internal APIs. Not recommended unless you're comfortable with Python.

Why it's complex:

  • ChromaDB uses internal data structures
  • Need to parse collection folder structure
  • No public API to bulk import without re-embedding

Solution 3: Accept the Current State and Move Forward

Since the collections are lost from sqlite3's index:

Option A: Re-create Collections from Scratch

  1. Delete ./chroma_db completely
  2. Use Streamlit UI to create new collections
  3. This is clean and ensures everything is consistent

Option B: Try ChromaDB Reset

# Stop Streamlit (Ctrl+C)

# Delete chroma_db completely
Remove-Item -Path ".\chroma_db" -Recurse -Force

# Delete any Streamlit cache
Remove-Item -Path "$env:USERPROFILE\.streamlit" -Recurse -Force

# Restart
streamlit run streamlit_app.py

# Create new collections using UI

Solution 4: Check Backup Directory

Step 1: List All Backups

cd "d:\CapStoneProject\RAG Capstone Project"
Get-ChildItem -Filter "chroma_db.backup_*" -Directory | Select-Object Name, LastWriteTime

Step 2: Check If Backup Has Collections

# List collections in a specific backup
$backupPath = ".\chroma_db.backup_20251220_083000"
Get-ChildItem -Path $backupPath -Directory | Where-Object {$_.Name -match "^[a-f0-9\-]{36}$"} | Measure-Object

Step 3: Restore That Backup

# Stop Streamlit
# Remove current
Remove-Item -Path ".\chroma_db" -Recurse -Force
# Restore backup
Copy-Item -Path ".\chroma_db.backup_20251220_083000" -Destination ".\chroma_db" -Recurse
# Restart Streamlit

Why This Happens

ChromaDB Architecture:

chroma.sqlite3 (Metadata Index)
β”œβ”€β”€ Collection 1 metadata
β”œβ”€β”€ Collection 2 metadata
└── Collection 3 metadata
     ↓ (references)
./chroma_db/
β”œβ”€β”€ UUID-folder-1/ (actual data files)
β”œβ”€β”€ UUID-folder-2/ (actual data files)
└── UUID-folder-3/ (actual data files)

When you delete chroma.sqlite3:

  • βœ… UUID folders remain (data is safe)
  • ❌ Index is gone (relationships are broken)
  • ❌ ChromaDB rebuilds empty sqlite3
  • ❌ Doesn't have reference to UUID folders

Prevention for Next Time

Don't Just Delete sqlite3

Instead, let ChromaDB handle cleanup properly:

# WRONG - causes this issue:
Remove-Item -Path ".\chroma_db\chroma.sqlite3" -Force

# RIGHT - use ChromaDB API:
# (See below)

Use Proper Reset Method

Create a reset_chromadb.py script:

import chromadb
from chromadb.config import Settings

def reset_chromadb(keep_data=False):
    """Properly reset ChromaDB."""
    client = chromadb.PersistentClient(
        path="./chroma_db",
        settings=Settings(
            anonymized_telemetry=False,
            allow_reset=True
        )
    )
    
    if keep_data:
        print("⚠️  Manual data recovery needed - see docs/CHROMADB_RECOVERY.md")
    else:
        print("πŸ”„ Resetting ChromaDB (will delete all collections)...")
        try:
            # Delete all collections properly
            for collection in client.list_collections():
                client.delete_collection(collection.name)
            print("βœ… ChromaDB reset successfully")
        except Exception as e:
            print(f"❌ Error: {e}")

if __name__ == "__main__":
    reset_chromadb()

Immediate Action Plan

Choose one:

Option 1 (Fastest): If you have a backup

# Restore backup
# Restart app

Option 2 (Clean restart): If no backup or backup damaged

# Delete entire chroma_db
# Restart Streamlit
# Create new collections using UI

Option 3 (Keep trying): For debugging

# Try Solution 2 (complex recovery)
# Run recover_collections.py for diagnostics

Files Provided

  1. recover_collections.py - Diagnostic script (tells you what's recoverable)
  2. This guide - Recovery procedures

Bottom Line

The safest approach: Use a backup or start fresh with new collections.

To proceed:

  1. Do you have a chroma_db.backup_* folder? If yes, use it
  2. If no, delete ./chroma_db and recreate collections
  3. Always backup before making changes to chroma_db

Let me know which option you want to pursue! πŸ› οΈ