# Merging External ChromaDB Collections ## Overview Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance. ## Prerequisites 1. **Source ChromaDB**: The external collection must be accessible 2. **Target ChromaDB**: Your project's ChromaDB (located at `./chroma_db` by default) 3. **Matching Embedding Model**: Both collections should use the same embedding model for consistency 4. **ChromaDB Version Compatibility**: Ensure both are using compatible ChromaDB versions --- ## Step-by-Step Merge Process ### **Step 1: Identify Collection Information** **From the External ChromaDB:** ``` - Source directory path: /path/to/external/chroma_db - Collection name: (e.g., "medical_docs_dense_mpnet") - Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2") - Chunking strategy: (e.g., "dense", "sparse", "hybrid") - Chunk size: (e.g., 512) - Chunk overlap: (e.g., 50) - Total documents/chunks: ? ``` **From Your Project:** ``` - Target directory: ./chroma_db (default, or configured in settings) - Existing collections: ? - Available embedding models: (check config.py) ``` ### **Step 2: Verify Embedding Model Compatibility** **Check if the external collection's embedding model is available in your project:** From `config.py`, the available embedding models are: ``` - sentence-transformers/all-mpnet-base-v2 - emilyalsentzer/Bio_ClinicalBERT - microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract - sentence-transformers/all-MiniLM-L6-v2 - sentence-transformers/multilingual-MiniLM-L12-v2 - sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - allenai/specter - gemini-embedding-001 ``` **If NOT in the list:** - Add the external embedding model to `config.py` embedding_models list - Or re-embed all documents with a compatible model (more complex) ### **Step 3: Prepare the External Collection Data** **Option A: Direct Copy of ChromaDB Directory** (Fastest) ``` 1. Locate external ChromaDB directory structure 2. Copy the external collection files to your ./chroma_db directory 3. ChromaDB will recognize and load them Directory structure: ./chroma_db/ ├── 0/ │ ├── data/ │ │ ├── documents.parquet │ │ ├── embeddings.parquet │ │ └── metadatas.parquet │ └── chroma.sqlite3 ``` **Option B: Export and Re-import** (Recommended) Extract all documents and metadata from external collection, then import into your collection --- ## Implementation Approaches ### **Approach 1: Manual Directory Merge** **Steps:** 1. Stop the project (stop Streamlit app) 2. Back up your current `./chroma_db` directory 3. Copy external collection files to `./chroma_db` 4. Restart the project 5. Verify collection appears in "Existing Collections" dropdown **Pros:** Fast, preserves embeddings **Cons:** Risk of conflicts if same collection name exists --- ### **Approach 2: Programmatic Merge (Recommended)** **High-level process:** ``` 1. Connect to external ChromaDB ├─ Load external collection ├─ Extract all documents, embeddings, and metadata 2. Prepare target ChromaDB ├─ Create/get target collection in your project ├─ Match embedding model and metadata 3. Transfer documents ├─ Batch transfer documents to target collection ├─ Verify all documents transferred ├─ Handle duplicates (if any) 4. Verify merge ├─ Count documents match ├─ Test retrieval works ├─ Validate embeddings are correct ``` --- ### **Approach 3: Using ChromaDB Export/Import** **Steps:** 1. **Export from external ChromaDB:** ``` - Get all collections - For each collection: * Get collection metadata * Export all documents + embeddings + metadata * Save to JSON/Parquet files ``` 2. **Import to your ChromaDB:** ``` - Create new collection with same metadata - Add documents + embeddings + metadata in batches - Verify document count and samples ``` --- ## Handling Potential Issues ### **Issue 1: Different Embedding Models** **Problem:** External collection uses embedding model not in your project **Solution:** - Option A: Add model to `config.py` and ensure it's installed - Option B: Re-embed with a compatible model (requires space and time) - Option C: Use Gemini API for embeddings if configured ### **Issue 2: Duplicate Collection Names** **Problem:** External collection has same name as existing collection **Solution:** - Rename the external collection before importing - Or merge into existing collection (combines data) ### **Issue 3: Different ChromaDB Versions** **Problem:** External ChromaDB version incompatible with project **Solution:** - Export to common format (JSON/CSV) - Re-import with compatible ChromaDB version - Update ChromaDB: `pip install --upgrade chromadb` ### **Issue 4: Metadata Mismatch** **Problem:** External collection metadata schema different from project **Solution:** - Map external metadata to project metadata structure - Add missing fields (chunking_strategy, chunk_size, etc.) - Preserve original metadata for reference --- ## Verification Checklist After merging, verify: - ✅ Collection appears in "Existing Collections" dropdown in Streamlit - ✅ Can load collection without errors - ✅ Document count matches expected total - ✅ Can query and retrieve documents (test with sample question) - ✅ Retrieved documents have correct embeddings - ✅ Metadata is preserved correctly - ✅ Evaluation metrics run without errors on merged collection - ✅ Both original and imported documents retrieve with correct distances --- ## Quick Reference: Manual Merge Steps If external collection is already in ChromaDB format: 1. **Backup your current collection:** ``` cp -r ./chroma_db ./chroma_db.backup ``` 2. **Find external ChromaDB location:** ``` /path/to/external/chroma_db ``` 3. **Copy collection files:** ``` Copy everything from /path/to/external/chroma_db to ./chroma_db ``` 4. **Restart Streamlit:** ``` streamlit run streamlit_app.py ``` 5. **Check Collections dropdown:** - External collection should now appear --- ## Recommended Merge Approach for Your Project ### **Best Practice: Programmatic Approach** 1. **List all external collections** → identify which to merge 2. **For each external collection:** - Export metadata (embedding model, chunking strategy, etc.) - Get all documents and embeddings - Create target collection in your project with matching metadata - Batch insert documents in groups of 100-1000 3. **Validate:** Test retrieval on merged collection 4. **Archive:** Keep backup of external ChromaDB ### **Why This Approach?** - ✅ Safe (no direct file manipulation) - ✅ Controllable (can inspect data during transfer) - ✅ Traceable (logs what was merged) - ✅ Flexible (can transform data if needed) - ✅ Recoverable (original external collection untouched) --- ## Example Data Flow ``` External ChromaDB ├── Collection: "medical_docs_dense_mpnet" │ ├── 5000 documents │ ├── Embeddings: 768-dim (all-mpnet-base-v2) │ └── Metadata: chunking_strategy, chunk_size, etc. │ └── [Extract documents, embeddings, metadata] ↓ Your Project's ChromaDB ├── New Collection: "medical_docs_dense_mpnet_imported" │ ├── Add 5000 documents in batches │ ├── Add corresponding embeddings │ ├── Add matching metadata │ └── Verify count: 5000 documents ✓ ↓ Test & Validate ├── Query retrieval works ✓ ├── Evaluation metrics compute ✓ └── Merged collection ready for use ✓ ``` --- ## Summary | Step | Action | Time | Complexity | |------|--------|------|-----------| | 1 | Identify collection info | 5 min | Low | | 2 | Verify embedding model | 5 min | Low | | 3 | Backup current data | 5 min | Low | | 4 | Perform merge | 10-30 min | Medium | | 5 | Verify merge success | 10 min | Medium | | **Total** | Complete merge | **35-55 min** | **Medium** | --- ## Next Steps Please provide: 1. **External ChromaDB path:** Where is the external ChromaDB located? 2. **Collection name:** What's the external collection called? 3. **Embedding model:** Which embedding model does it use? 4. **Document count:** Approximately how many documents? 5. **Metadata:** What metadata is stored (chunking strategy, chunk size, etc.)? Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.