CapStoneRAG10 / docs /CHROMADB_MERGE_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# Merging External ChromaDB Collections
## Overview
Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.
## Prerequisites
1. **Source ChromaDB**: The external collection must be accessible
2. **Target ChromaDB**: Your project's ChromaDB (located at `./chroma_db` by default)
3. **Matching Embedding Model**: Both collections should use the same embedding model for consistency
4. **ChromaDB Version Compatibility**: Ensure both are using compatible ChromaDB versions
---
## Step-by-Step Merge Process
### **Step 1: Identify Collection Information**
**From the External ChromaDB:**
```
- Source directory path: /path/to/external/chroma_db
- Collection name: (e.g., "medical_docs_dense_mpnet")
- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
- Chunk size: (e.g., 512)
- Chunk overlap: (e.g., 50)
- Total documents/chunks: ?
```
**From Your Project:**
```
- Target directory: ./chroma_db (default, or configured in settings)
- Existing collections: ?
- Available embedding models: (check config.py)
```
### **Step 2: Verify Embedding Model Compatibility**
**Check if the external collection's embedding model is available in your project:**
From `config.py`, the available embedding models are:
```
- sentence-transformers/all-mpnet-base-v2
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- allenai/specter
- gemini-embedding-001
```
**If NOT in the list:**
- Add the external embedding model to `config.py` embedding_models list
- Or re-embed all documents with a compatible model (more complex)
### **Step 3: Prepare the External Collection Data**
**Option A: Direct Copy of ChromaDB Directory** (Fastest)
```
1. Locate external ChromaDB directory structure
2. Copy the external collection files to your ./chroma_db directory
3. ChromaDB will recognize and load them
Directory structure:
./chroma_db/
β”œβ”€β”€ 0/
β”‚ β”œβ”€β”€ data/
β”‚ β”‚ β”œβ”€β”€ documents.parquet
β”‚ β”‚ β”œβ”€β”€ embeddings.parquet
β”‚ β”‚ └── metadatas.parquet
β”‚ └── chroma.sqlite3
```
**Option B: Export and Re-import** (Recommended)
Extract all documents and metadata from external collection, then import into your collection
---
## Implementation Approaches
### **Approach 1: Manual Directory Merge**
**Steps:**
1. Stop the project (stop Streamlit app)
2. Back up your current `./chroma_db` directory
3. Copy external collection files to `./chroma_db`
4. Restart the project
5. Verify collection appears in "Existing Collections" dropdown
**Pros:** Fast, preserves embeddings
**Cons:** Risk of conflicts if same collection name exists
---
### **Approach 2: Programmatic Merge (Recommended)**
**High-level process:**
```
1. Connect to external ChromaDB
β”œβ”€ Load external collection
β”œβ”€ Extract all documents, embeddings, and metadata
2. Prepare target ChromaDB
β”œβ”€ Create/get target collection in your project
β”œβ”€ Match embedding model and metadata
3. Transfer documents
β”œβ”€ Batch transfer documents to target collection
β”œβ”€ Verify all documents transferred
β”œβ”€ Handle duplicates (if any)
4. Verify merge
β”œβ”€ Count documents match
β”œβ”€ Test retrieval works
β”œβ”€ Validate embeddings are correct
```
---
### **Approach 3: Using ChromaDB Export/Import**
**Steps:**
1. **Export from external ChromaDB:**
```
- Get all collections
- For each collection:
* Get collection metadata
* Export all documents + embeddings + metadata
* Save to JSON/Parquet files
```
2. **Import to your ChromaDB:**
```
- Create new collection with same metadata
- Add documents + embeddings + metadata in batches
- Verify document count and samples
```
---
## Handling Potential Issues
### **Issue 1: Different Embedding Models**
**Problem:** External collection uses embedding model not in your project
**Solution:**
- Option A: Add model to `config.py` and ensure it's installed
- Option B: Re-embed with a compatible model (requires space and time)
- Option C: Use Gemini API for embeddings if configured
### **Issue 2: Duplicate Collection Names**
**Problem:** External collection has same name as existing collection
**Solution:**
- Rename the external collection before importing
- Or merge into existing collection (combines data)
### **Issue 3: Different ChromaDB Versions**
**Problem:** External ChromaDB version incompatible with project
**Solution:**
- Export to common format (JSON/CSV)
- Re-import with compatible ChromaDB version
- Update ChromaDB: `pip install --upgrade chromadb`
### **Issue 4: Metadata Mismatch**
**Problem:** External collection metadata schema different from project
**Solution:**
- Map external metadata to project metadata structure
- Add missing fields (chunking_strategy, chunk_size, etc.)
- Preserve original metadata for reference
---
## Verification Checklist
After merging, verify:
- βœ… Collection appears in "Existing Collections" dropdown in Streamlit
- βœ… Can load collection without errors
- βœ… Document count matches expected total
- βœ… Can query and retrieve documents (test with sample question)
- βœ… Retrieved documents have correct embeddings
- βœ… Metadata is preserved correctly
- βœ… Evaluation metrics run without errors on merged collection
- βœ… Both original and imported documents retrieve with correct distances
---
## Quick Reference: Manual Merge Steps
If external collection is already in ChromaDB format:
1. **Backup your current collection:**
```
cp -r ./chroma_db ./chroma_db.backup
```
2. **Find external ChromaDB location:**
```
/path/to/external/chroma_db
```
3. **Copy collection files:**
```
Copy everything from /path/to/external/chroma_db to ./chroma_db
```
4. **Restart Streamlit:**
```
streamlit run streamlit_app.py
```
5. **Check Collections dropdown:**
- External collection should now appear
---
## Recommended Merge Approach for Your Project
### **Best Practice: Programmatic Approach**
1. **List all external collections** β†’ identify which to merge
2. **For each external collection:**
- Export metadata (embedding model, chunking strategy, etc.)
- Get all documents and embeddings
- Create target collection in your project with matching metadata
- Batch insert documents in groups of 100-1000
3. **Validate:** Test retrieval on merged collection
4. **Archive:** Keep backup of external ChromaDB
### **Why This Approach?**
- βœ… Safe (no direct file manipulation)
- βœ… Controllable (can inspect data during transfer)
- βœ… Traceable (logs what was merged)
- βœ… Flexible (can transform data if needed)
- βœ… Recoverable (original external collection untouched)
---
## Example Data Flow
```
External ChromaDB
β”œβ”€β”€ Collection: "medical_docs_dense_mpnet"
β”‚ β”œβ”€β”€ 5000 documents
β”‚ β”œβ”€β”€ Embeddings: 768-dim (all-mpnet-base-v2)
β”‚ └── Metadata: chunking_strategy, chunk_size, etc.
β”‚
└── [Extract documents, embeddings, metadata]
↓
Your Project's ChromaDB
β”œβ”€β”€ New Collection: "medical_docs_dense_mpnet_imported"
β”‚ β”œβ”€β”€ Add 5000 documents in batches
β”‚ β”œβ”€β”€ Add corresponding embeddings
β”‚ β”œβ”€β”€ Add matching metadata
β”‚ └── Verify count: 5000 documents βœ“
↓
Test & Validate
β”œβ”€β”€ Query retrieval works βœ“
β”œβ”€β”€ Evaluation metrics compute βœ“
└── Merged collection ready for use βœ“
```
---
## Summary
| Step | Action | Time | Complexity |
|------|--------|------|-----------|
| 1 | Identify collection info | 5 min | Low |
| 2 | Verify embedding model | 5 min | Low |
| 3 | Backup current data | 5 min | Low |
| 4 | Perform merge | 10-30 min | Medium |
| 5 | Verify merge success | 10 min | Medium |
| **Total** | Complete merge | **35-55 min** | **Medium** |
---
## Next Steps
Please provide:
1. **External ChromaDB path:** Where is the external ChromaDB located?
2. **Collection name:** What's the external collection called?
3. **Embedding model:** Which embedding model does it use?
4. **Document count:** Approximately how many documents?
5. **Metadata:** What metadata is stored (chunking strategy, chunk size, etc.)?
Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.