Spaces:
Sleeping
Merging External ChromaDB Collections
Overview
Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.
Prerequisites
- Source ChromaDB: The external collection must be accessible
- Target ChromaDB: Your project's ChromaDB (located at
./chroma_dbby default) - Matching Embedding Model: Both collections should use the same embedding model for consistency
- ChromaDB Version Compatibility: Ensure both are using compatible ChromaDB versions
Step-by-Step Merge Process
Step 1: Identify Collection Information
From the External ChromaDB:
- Source directory path: /path/to/external/chroma_db
- Collection name: (e.g., "medical_docs_dense_mpnet")
- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
- Chunk size: (e.g., 512)
- Chunk overlap: (e.g., 50)
- Total documents/chunks: ?
From Your Project:
- Target directory: ./chroma_db (default, or configured in settings)
- Existing collections: ?
- Available embedding models: (check config.py)
Step 2: Verify Embedding Model Compatibility
Check if the external collection's embedding model is available in your project:
From config.py, the available embedding models are:
- sentence-transformers/all-mpnet-base-v2
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- allenai/specter
- gemini-embedding-001
If NOT in the list:
- Add the external embedding model to
config.pyembedding_models list - Or re-embed all documents with a compatible model (more complex)
Step 3: Prepare the External Collection Data
Option A: Direct Copy of ChromaDB Directory (Fastest)
1. Locate external ChromaDB directory structure
2. Copy the external collection files to your ./chroma_db directory
3. ChromaDB will recognize and load them
Directory structure:
./chroma_db/
βββ 0/
β βββ data/
β β βββ documents.parquet
β β βββ embeddings.parquet
β β βββ metadatas.parquet
β βββ chroma.sqlite3
Option B: Export and Re-import (Recommended) Extract all documents and metadata from external collection, then import into your collection
Implementation Approaches
Approach 1: Manual Directory Merge
Steps:
- Stop the project (stop Streamlit app)
- Back up your current
./chroma_dbdirectory - Copy external collection files to
./chroma_db - Restart the project
- Verify collection appears in "Existing Collections" dropdown
Pros: Fast, preserves embeddings Cons: Risk of conflicts if same collection name exists
Approach 2: Programmatic Merge (Recommended)
High-level process:
1. Connect to external ChromaDB
ββ Load external collection
ββ Extract all documents, embeddings, and metadata
2. Prepare target ChromaDB
ββ Create/get target collection in your project
ββ Match embedding model and metadata
3. Transfer documents
ββ Batch transfer documents to target collection
ββ Verify all documents transferred
ββ Handle duplicates (if any)
4. Verify merge
ββ Count documents match
ββ Test retrieval works
ββ Validate embeddings are correct
Approach 3: Using ChromaDB Export/Import
Steps:
Export from external ChromaDB:
- Get all collections - For each collection: * Get collection metadata * Export all documents + embeddings + metadata * Save to JSON/Parquet filesImport to your ChromaDB:
- Create new collection with same metadata - Add documents + embeddings + metadata in batches - Verify document count and samples
Handling Potential Issues
Issue 1: Different Embedding Models
Problem: External collection uses embedding model not in your project
Solution:
- Option A: Add model to
config.pyand ensure it's installed - Option B: Re-embed with a compatible model (requires space and time)
- Option C: Use Gemini API for embeddings if configured
Issue 2: Duplicate Collection Names
Problem: External collection has same name as existing collection
Solution:
- Rename the external collection before importing
- Or merge into existing collection (combines data)
Issue 3: Different ChromaDB Versions
Problem: External ChromaDB version incompatible with project
Solution:
- Export to common format (JSON/CSV)
- Re-import with compatible ChromaDB version
- Update ChromaDB:
pip install --upgrade chromadb
Issue 4: Metadata Mismatch
Problem: External collection metadata schema different from project
Solution:
- Map external metadata to project metadata structure
- Add missing fields (chunking_strategy, chunk_size, etc.)
- Preserve original metadata for reference
Verification Checklist
After merging, verify:
- β Collection appears in "Existing Collections" dropdown in Streamlit
- β Can load collection without errors
- β Document count matches expected total
- β Can query and retrieve documents (test with sample question)
- β Retrieved documents have correct embeddings
- β Metadata is preserved correctly
- β Evaluation metrics run without errors on merged collection
- β Both original and imported documents retrieve with correct distances
Quick Reference: Manual Merge Steps
If external collection is already in ChromaDB format:
Backup your current collection:
cp -r ./chroma_db ./chroma_db.backupFind external ChromaDB location:
/path/to/external/chroma_dbCopy collection files:
Copy everything from /path/to/external/chroma_db to ./chroma_dbRestart Streamlit:
streamlit run streamlit_app.pyCheck Collections dropdown:
- External collection should now appear
Recommended Merge Approach for Your Project
Best Practice: Programmatic Approach
- List all external collections β identify which to merge
- For each external collection:
- Export metadata (embedding model, chunking strategy, etc.)
- Get all documents and embeddings
- Create target collection in your project with matching metadata
- Batch insert documents in groups of 100-1000
- Validate: Test retrieval on merged collection
- Archive: Keep backup of external ChromaDB
Why This Approach?
- β Safe (no direct file manipulation)
- β Controllable (can inspect data during transfer)
- β Traceable (logs what was merged)
- β Flexible (can transform data if needed)
- β Recoverable (original external collection untouched)
Example Data Flow
External ChromaDB
βββ Collection: "medical_docs_dense_mpnet"
β βββ 5000 documents
β βββ Embeddings: 768-dim (all-mpnet-base-v2)
β βββ Metadata: chunking_strategy, chunk_size, etc.
β
βββ [Extract documents, embeddings, metadata]
β
Your Project's ChromaDB
βββ New Collection: "medical_docs_dense_mpnet_imported"
β βββ Add 5000 documents in batches
β βββ Add corresponding embeddings
β βββ Add matching metadata
β βββ Verify count: 5000 documents β
β
Test & Validate
βββ Query retrieval works β
βββ Evaluation metrics compute β
βββ Merged collection ready for use β
Summary
| Step | Action | Time | Complexity |
|---|---|---|---|
| 1 | Identify collection info | 5 min | Low |
| 2 | Verify embedding model | 5 min | Low |
| 3 | Backup current data | 5 min | Low |
| 4 | Perform merge | 10-30 min | Medium |
| 5 | Verify merge success | 10 min | Medium |
| Total | Complete merge | 35-55 min | Medium |
Next Steps
Please provide:
- External ChromaDB path: Where is the external ChromaDB located?
- Collection name: What's the external collection called?
- Embedding model: Which embedding model does it use?
- Document count: Approximately how many documents?
- Metadata: What metadata is stored (chunking strategy, chunk size, etc.)?
Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.