Spaces:
Sleeping
Sleeping
| # Merging External ChromaDB Collections | |
| ## Overview | |
| Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance. | |
| ## Prerequisites | |
| 1. **Source ChromaDB**: The external collection must be accessible | |
| 2. **Target ChromaDB**: Your project's ChromaDB (located at `./chroma_db` by default) | |
| 3. **Matching Embedding Model**: Both collections should use the same embedding model for consistency | |
| 4. **ChromaDB Version Compatibility**: Ensure both are using compatible ChromaDB versions | |
| --- | |
| ## Step-by-Step Merge Process | |
| ### **Step 1: Identify Collection Information** | |
| **From the External ChromaDB:** | |
| ``` | |
| - Source directory path: /path/to/external/chroma_db | |
| - Collection name: (e.g., "medical_docs_dense_mpnet") | |
| - Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2") | |
| - Chunking strategy: (e.g., "dense", "sparse", "hybrid") | |
| - Chunk size: (e.g., 512) | |
| - Chunk overlap: (e.g., 50) | |
| - Total documents/chunks: ? | |
| ``` | |
| **From Your Project:** | |
| ``` | |
| - Target directory: ./chroma_db (default, or configured in settings) | |
| - Existing collections: ? | |
| - Available embedding models: (check config.py) | |
| ``` | |
| ### **Step 2: Verify Embedding Model Compatibility** | |
| **Check if the external collection's embedding model is available in your project:** | |
| From `config.py`, the available embedding models are: | |
| ``` | |
| - sentence-transformers/all-mpnet-base-v2 | |
| - emilyalsentzer/Bio_ClinicalBERT | |
| - microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract | |
| - sentence-transformers/all-MiniLM-L6-v2 | |
| - sentence-transformers/multilingual-MiniLM-L12-v2 | |
| - sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | |
| - allenai/specter | |
| - gemini-embedding-001 | |
| ``` | |
| **If NOT in the list:** | |
| - Add the external embedding model to `config.py` embedding_models list | |
| - Or re-embed all documents with a compatible model (more complex) | |
| ### **Step 3: Prepare the External Collection Data** | |
| **Option A: Direct Copy of ChromaDB Directory** (Fastest) | |
| ``` | |
| 1. Locate external ChromaDB directory structure | |
| 2. Copy the external collection files to your ./chroma_db directory | |
| 3. ChromaDB will recognize and load them | |
| Directory structure: | |
| ./chroma_db/ | |
| βββ 0/ | |
| β βββ data/ | |
| β β βββ documents.parquet | |
| β β βββ embeddings.parquet | |
| β β βββ metadatas.parquet | |
| β βββ chroma.sqlite3 | |
| ``` | |
| **Option B: Export and Re-import** (Recommended) | |
| Extract all documents and metadata from external collection, then import into your collection | |
| --- | |
| ## Implementation Approaches | |
| ### **Approach 1: Manual Directory Merge** | |
| **Steps:** | |
| 1. Stop the project (stop Streamlit app) | |
| 2. Back up your current `./chroma_db` directory | |
| 3. Copy external collection files to `./chroma_db` | |
| 4. Restart the project | |
| 5. Verify collection appears in "Existing Collections" dropdown | |
| **Pros:** Fast, preserves embeddings | |
| **Cons:** Risk of conflicts if same collection name exists | |
| --- | |
| ### **Approach 2: Programmatic Merge (Recommended)** | |
| **High-level process:** | |
| ``` | |
| 1. Connect to external ChromaDB | |
| ββ Load external collection | |
| ββ Extract all documents, embeddings, and metadata | |
| 2. Prepare target ChromaDB | |
| ββ Create/get target collection in your project | |
| ββ Match embedding model and metadata | |
| 3. Transfer documents | |
| ββ Batch transfer documents to target collection | |
| ββ Verify all documents transferred | |
| ββ Handle duplicates (if any) | |
| 4. Verify merge | |
| ββ Count documents match | |
| ββ Test retrieval works | |
| ββ Validate embeddings are correct | |
| ``` | |
| --- | |
| ### **Approach 3: Using ChromaDB Export/Import** | |
| **Steps:** | |
| 1. **Export from external ChromaDB:** | |
| ``` | |
| - Get all collections | |
| - For each collection: | |
| * Get collection metadata | |
| * Export all documents + embeddings + metadata | |
| * Save to JSON/Parquet files | |
| ``` | |
| 2. **Import to your ChromaDB:** | |
| ``` | |
| - Create new collection with same metadata | |
| - Add documents + embeddings + metadata in batches | |
| - Verify document count and samples | |
| ``` | |
| --- | |
| ## Handling Potential Issues | |
| ### **Issue 1: Different Embedding Models** | |
| **Problem:** External collection uses embedding model not in your project | |
| **Solution:** | |
| - Option A: Add model to `config.py` and ensure it's installed | |
| - Option B: Re-embed with a compatible model (requires space and time) | |
| - Option C: Use Gemini API for embeddings if configured | |
| ### **Issue 2: Duplicate Collection Names** | |
| **Problem:** External collection has same name as existing collection | |
| **Solution:** | |
| - Rename the external collection before importing | |
| - Or merge into existing collection (combines data) | |
| ### **Issue 3: Different ChromaDB Versions** | |
| **Problem:** External ChromaDB version incompatible with project | |
| **Solution:** | |
| - Export to common format (JSON/CSV) | |
| - Re-import with compatible ChromaDB version | |
| - Update ChromaDB: `pip install --upgrade chromadb` | |
| ### **Issue 4: Metadata Mismatch** | |
| **Problem:** External collection metadata schema different from project | |
| **Solution:** | |
| - Map external metadata to project metadata structure | |
| - Add missing fields (chunking_strategy, chunk_size, etc.) | |
| - Preserve original metadata for reference | |
| --- | |
| ## Verification Checklist | |
| After merging, verify: | |
| - β Collection appears in "Existing Collections" dropdown in Streamlit | |
| - β Can load collection without errors | |
| - β Document count matches expected total | |
| - β Can query and retrieve documents (test with sample question) | |
| - β Retrieved documents have correct embeddings | |
| - β Metadata is preserved correctly | |
| - β Evaluation metrics run without errors on merged collection | |
| - β Both original and imported documents retrieve with correct distances | |
| --- | |
| ## Quick Reference: Manual Merge Steps | |
| If external collection is already in ChromaDB format: | |
| 1. **Backup your current collection:** | |
| ``` | |
| cp -r ./chroma_db ./chroma_db.backup | |
| ``` | |
| 2. **Find external ChromaDB location:** | |
| ``` | |
| /path/to/external/chroma_db | |
| ``` | |
| 3. **Copy collection files:** | |
| ``` | |
| Copy everything from /path/to/external/chroma_db to ./chroma_db | |
| ``` | |
| 4. **Restart Streamlit:** | |
| ``` | |
| streamlit run streamlit_app.py | |
| ``` | |
| 5. **Check Collections dropdown:** | |
| - External collection should now appear | |
| --- | |
| ## Recommended Merge Approach for Your Project | |
| ### **Best Practice: Programmatic Approach** | |
| 1. **List all external collections** β identify which to merge | |
| 2. **For each external collection:** | |
| - Export metadata (embedding model, chunking strategy, etc.) | |
| - Get all documents and embeddings | |
| - Create target collection in your project with matching metadata | |
| - Batch insert documents in groups of 100-1000 | |
| 3. **Validate:** Test retrieval on merged collection | |
| 4. **Archive:** Keep backup of external ChromaDB | |
| ### **Why This Approach?** | |
| - β Safe (no direct file manipulation) | |
| - β Controllable (can inspect data during transfer) | |
| - β Traceable (logs what was merged) | |
| - β Flexible (can transform data if needed) | |
| - β Recoverable (original external collection untouched) | |
| --- | |
| ## Example Data Flow | |
| ``` | |
| External ChromaDB | |
| βββ Collection: "medical_docs_dense_mpnet" | |
| β βββ 5000 documents | |
| β βββ Embeddings: 768-dim (all-mpnet-base-v2) | |
| β βββ Metadata: chunking_strategy, chunk_size, etc. | |
| β | |
| βββ [Extract documents, embeddings, metadata] | |
| β | |
| Your Project's ChromaDB | |
| βββ New Collection: "medical_docs_dense_mpnet_imported" | |
| β βββ Add 5000 documents in batches | |
| β βββ Add corresponding embeddings | |
| β βββ Add matching metadata | |
| β βββ Verify count: 5000 documents β | |
| β | |
| Test & Validate | |
| βββ Query retrieval works β | |
| βββ Evaluation metrics compute β | |
| βββ Merged collection ready for use β | |
| ``` | |
| --- | |
| ## Summary | |
| | Step | Action | Time | Complexity | | |
| |------|--------|------|-----------| | |
| | 1 | Identify collection info | 5 min | Low | | |
| | 2 | Verify embedding model | 5 min | Low | | |
| | 3 | Backup current data | 5 min | Low | | |
| | 4 | Perform merge | 10-30 min | Medium | | |
| | 5 | Verify merge success | 10 min | Medium | | |
| | **Total** | Complete merge | **35-55 min** | **Medium** | | |
| --- | |
| ## Next Steps | |
| Please provide: | |
| 1. **External ChromaDB path:** Where is the external ChromaDB located? | |
| 2. **Collection name:** What's the external collection called? | |
| 3. **Embedding model:** Which embedding model does it use? | |
| 4. **Document count:** Approximately how many documents? | |
| 5. **Metadata:** What metadata is stored (chunking strategy, chunk size, etc.)? | |
| Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario. | |