CapStoneRAG10 / docs /CHROMADB_MERGE_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

Merging External ChromaDB Collections

Overview

Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.

Prerequisites

  1. Source ChromaDB: The external collection must be accessible
  2. Target ChromaDB: Your project's ChromaDB (located at ./chroma_db by default)
  3. Matching Embedding Model: Both collections should use the same embedding model for consistency
  4. ChromaDB Version Compatibility: Ensure both are using compatible ChromaDB versions

Step-by-Step Merge Process

Step 1: Identify Collection Information

From the External ChromaDB:

- Source directory path: /path/to/external/chroma_db
- Collection name: (e.g., "medical_docs_dense_mpnet")
- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
- Chunk size: (e.g., 512)
- Chunk overlap: (e.g., 50)
- Total documents/chunks: ?

From Your Project:

- Target directory: ./chroma_db (default, or configured in settings)
- Existing collections: ?
- Available embedding models: (check config.py)

Step 2: Verify Embedding Model Compatibility

Check if the external collection's embedding model is available in your project:

From config.py, the available embedding models are:

- sentence-transformers/all-mpnet-base-v2
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- allenai/specter
- gemini-embedding-001

If NOT in the list:

  • Add the external embedding model to config.py embedding_models list
  • Or re-embed all documents with a compatible model (more complex)

Step 3: Prepare the External Collection Data

Option A: Direct Copy of ChromaDB Directory (Fastest)

1. Locate external ChromaDB directory structure
2. Copy the external collection files to your ./chroma_db directory
3. ChromaDB will recognize and load them

Directory structure:
  ./chroma_db/
    β”œβ”€β”€ 0/
    β”‚   β”œβ”€β”€ data/
    β”‚   β”‚   β”œβ”€β”€ documents.parquet
    β”‚   β”‚   β”œβ”€β”€ embeddings.parquet
    β”‚   β”‚   └── metadatas.parquet
    β”‚   └── chroma.sqlite3

Option B: Export and Re-import (Recommended) Extract all documents and metadata from external collection, then import into your collection


Implementation Approaches

Approach 1: Manual Directory Merge

Steps:

  1. Stop the project (stop Streamlit app)
  2. Back up your current ./chroma_db directory
  3. Copy external collection files to ./chroma_db
  4. Restart the project
  5. Verify collection appears in "Existing Collections" dropdown

Pros: Fast, preserves embeddings Cons: Risk of conflicts if same collection name exists


Approach 2: Programmatic Merge (Recommended)

High-level process:

1. Connect to external ChromaDB
   β”œβ”€ Load external collection
   β”œβ”€ Extract all documents, embeddings, and metadata
   
2. Prepare target ChromaDB
   β”œβ”€ Create/get target collection in your project
   β”œβ”€ Match embedding model and metadata
   
3. Transfer documents
   β”œβ”€ Batch transfer documents to target collection
   β”œβ”€ Verify all documents transferred
   β”œβ”€ Handle duplicates (if any)
   
4. Verify merge
   β”œβ”€ Count documents match
   β”œβ”€ Test retrieval works
   β”œβ”€ Validate embeddings are correct

Approach 3: Using ChromaDB Export/Import

Steps:

  1. Export from external ChromaDB:

    - Get all collections
    - For each collection:
      * Get collection metadata
      * Export all documents + embeddings + metadata
      * Save to JSON/Parquet files
    
  2. Import to your ChromaDB:

    - Create new collection with same metadata
    - Add documents + embeddings + metadata in batches
    - Verify document count and samples
    

Handling Potential Issues

Issue 1: Different Embedding Models

Problem: External collection uses embedding model not in your project

Solution:

  • Option A: Add model to config.py and ensure it's installed
  • Option B: Re-embed with a compatible model (requires space and time)
  • Option C: Use Gemini API for embeddings if configured

Issue 2: Duplicate Collection Names

Problem: External collection has same name as existing collection

Solution:

  • Rename the external collection before importing
  • Or merge into existing collection (combines data)

Issue 3: Different ChromaDB Versions

Problem: External ChromaDB version incompatible with project

Solution:

  • Export to common format (JSON/CSV)
  • Re-import with compatible ChromaDB version
  • Update ChromaDB: pip install --upgrade chromadb

Issue 4: Metadata Mismatch

Problem: External collection metadata schema different from project

Solution:

  • Map external metadata to project metadata structure
  • Add missing fields (chunking_strategy, chunk_size, etc.)
  • Preserve original metadata for reference

Verification Checklist

After merging, verify:

  • βœ… Collection appears in "Existing Collections" dropdown in Streamlit
  • βœ… Can load collection without errors
  • βœ… Document count matches expected total
  • βœ… Can query and retrieve documents (test with sample question)
  • βœ… Retrieved documents have correct embeddings
  • βœ… Metadata is preserved correctly
  • βœ… Evaluation metrics run without errors on merged collection
  • βœ… Both original and imported documents retrieve with correct distances

Quick Reference: Manual Merge Steps

If external collection is already in ChromaDB format:

  1. Backup your current collection:

    cp -r ./chroma_db ./chroma_db.backup
    
  2. Find external ChromaDB location:

    /path/to/external/chroma_db
    
  3. Copy collection files:

    Copy everything from /path/to/external/chroma_db to ./chroma_db
    
  4. Restart Streamlit:

    streamlit run streamlit_app.py
    
  5. Check Collections dropdown:

    • External collection should now appear

Recommended Merge Approach for Your Project

Best Practice: Programmatic Approach

  1. List all external collections β†’ identify which to merge
  2. For each external collection:
    • Export metadata (embedding model, chunking strategy, etc.)
    • Get all documents and embeddings
    • Create target collection in your project with matching metadata
    • Batch insert documents in groups of 100-1000
  3. Validate: Test retrieval on merged collection
  4. Archive: Keep backup of external ChromaDB

Why This Approach?

  • βœ… Safe (no direct file manipulation)
  • βœ… Controllable (can inspect data during transfer)
  • βœ… Traceable (logs what was merged)
  • βœ… Flexible (can transform data if needed)
  • βœ… Recoverable (original external collection untouched)

Example Data Flow

External ChromaDB
β”œβ”€β”€ Collection: "medical_docs_dense_mpnet"
β”‚   β”œβ”€β”€ 5000 documents
β”‚   β”œβ”€β”€ Embeddings: 768-dim (all-mpnet-base-v2)
β”‚   └── Metadata: chunking_strategy, chunk_size, etc.
β”‚
└── [Extract documents, embeddings, metadata]
    ↓
Your Project's ChromaDB
β”œβ”€β”€ New Collection: "medical_docs_dense_mpnet_imported"
β”‚   β”œβ”€β”€ Add 5000 documents in batches
β”‚   β”œβ”€β”€ Add corresponding embeddings
β”‚   β”œβ”€β”€ Add matching metadata
β”‚   └── Verify count: 5000 documents βœ“
    ↓
Test & Validate
β”œβ”€β”€ Query retrieval works βœ“
β”œβ”€β”€ Evaluation metrics compute βœ“
└── Merged collection ready for use βœ“

Summary

Step Action Time Complexity
1 Identify collection info 5 min Low
2 Verify embedding model 5 min Low
3 Backup current data 5 min Low
4 Perform merge 10-30 min Medium
5 Verify merge success 10 min Medium
Total Complete merge 35-55 min Medium

Next Steps

Please provide:

  1. External ChromaDB path: Where is the external ChromaDB located?
  2. Collection name: What's the external collection called?
  3. Embedding model: Which embedding model does it use?
  4. Document count: Approximately how many documents?
  5. Metadata: What metadata is stored (chunking strategy, chunk size, etc.)?

Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.