Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 8,622 Bytes

1d10b0a

# Merging External ChromaDB Collections

## Overview

Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.

## Prerequisites

1. **Source ChromaDB**: The external collection must be accessible
2. **Target ChromaDB**: Your project's ChromaDB (located at `./chroma_db` by default)
3. **Matching Embedding Model**: Both collections should use the same embedding model for consistency
4. **ChromaDB Version Compatibility**: Ensure both are using compatible ChromaDB versions

---

## Step-by-Step Merge Process

### **Step 1: Identify Collection Information**

**From the External ChromaDB:**
```
- Source directory path: /path/to/external/chroma_db
- Collection name: (e.g., "medical_docs_dense_mpnet")
- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
- Chunk size: (e.g., 512)
- Chunk overlap: (e.g., 50)
- Total documents/chunks: ?
```

**From Your Project:**
```
- Target directory: ./chroma_db (default, or configured in settings)
- Existing collections: ?
- Available embedding models: (check config.py)
```

### **Step 2: Verify Embedding Model Compatibility**

**Check if the external collection's embedding model is available in your project:**

From `config.py`, the available embedding models are:
```
- sentence-transformers/all-mpnet-base-v2
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- allenai/specter
- gemini-embedding-001
```

**If NOT in the list:**
- Add the external embedding model to `config.py` embedding_models list
- Or re-embed all documents with a compatible model (more complex)

### **Step 3: Prepare the External Collection Data**

**Option A: Direct Copy of ChromaDB Directory** (Fastest)
```
1. Locate external ChromaDB directory structure
2. Copy the external collection files to your ./chroma_db directory
3. ChromaDB will recognize and load them

Directory structure:
  ./chroma_db/
    ├── 0/
    │   ├── data/
    │   │   ├── documents.parquet
    │   │   ├── embeddings.parquet
    │   │   └── metadatas.parquet
    │   └── chroma.sqlite3
```

**Option B: Export and Re-import** (Recommended)
Extract all documents and metadata from external collection, then import into your collection

---

## Implementation Approaches

### **Approach 1: Manual Directory Merge**

**Steps:**
1. Stop the project (stop Streamlit app)
2. Back up your current `./chroma_db` directory
3. Copy external collection files to `./chroma_db`
4. Restart the project
5. Verify collection appears in "Existing Collections" dropdown

**Pros:** Fast, preserves embeddings
**Cons:** Risk of conflicts if same collection name exists

---

### **Approach 2: Programmatic Merge (Recommended)**

**High-level process:**

```
1. Connect to external ChromaDB
   ├─ Load external collection
   ├─ Extract all documents, embeddings, and metadata
   
2. Prepare target ChromaDB
   ├─ Create/get target collection in your project
   ├─ Match embedding model and metadata
   
3. Transfer documents
   ├─ Batch transfer documents to target collection
   ├─ Verify all documents transferred
   ├─ Handle duplicates (if any)
   
4. Verify merge
   ├─ Count documents match
   ├─ Test retrieval works
   ├─ Validate embeddings are correct
```

---

### **Approach 3: Using ChromaDB Export/Import**

**Steps:**

1. **Export from external ChromaDB:**
   ```
   - Get all collections
   - For each collection:
     * Get collection metadata
     * Export all documents + embeddings + metadata
     * Save to JSON/Parquet files
   ```

2. **Import to your ChromaDB:**
   ```
   - Create new collection with same metadata
   - Add documents + embeddings + metadata in batches
   - Verify document count and samples
   ```

---

## Handling Potential Issues

### **Issue 1: Different Embedding Models**

**Problem:** External collection uses embedding model not in your project

**Solution:**
- Option A: Add model to `config.py` and ensure it's installed
- Option B: Re-embed with a compatible model (requires space and time)
- Option C: Use Gemini API for embeddings if configured

### **Issue 2: Duplicate Collection Names**

**Problem:** External collection has same name as existing collection

**Solution:**
- Rename the external collection before importing
- Or merge into existing collection (combines data)

### **Issue 3: Different ChromaDB Versions**

**Problem:** External ChromaDB version incompatible with project

**Solution:**
- Export to common format (JSON/CSV)
- Re-import with compatible ChromaDB version
- Update ChromaDB: `pip install --upgrade chromadb`

### **Issue 4: Metadata Mismatch**

**Problem:** External collection metadata schema different from project

**Solution:**
- Map external metadata to project metadata structure
- Add missing fields (chunking_strategy, chunk_size, etc.)
- Preserve original metadata for reference

---

## Verification Checklist

After merging, verify:

- ✅ Collection appears in "Existing Collections" dropdown in Streamlit
- ✅ Can load collection without errors
- ✅ Document count matches expected total
- ✅ Can query and retrieve documents (test with sample question)
- ✅ Retrieved documents have correct embeddings
- ✅ Metadata is preserved correctly
- ✅ Evaluation metrics run without errors on merged collection
- ✅ Both original and imported documents retrieve with correct distances

---

## Quick Reference: Manual Merge Steps

If external collection is already in ChromaDB format:

1. **Backup your current collection:**
   ```
   cp -r ./chroma_db ./chroma_db.backup
   ```

2. **Find external ChromaDB location:**
   ```
   /path/to/external/chroma_db
   ```

3. **Copy collection files:**
   ```
   Copy everything from /path/to/external/chroma_db to ./chroma_db
   ```

4. **Restart Streamlit:**
   ```
   streamlit run streamlit_app.py
   ```

5. **Check Collections dropdown:**
   - External collection should now appear

---

## Recommended Merge Approach for Your Project

### **Best Practice: Programmatic Approach**

1. **List all external collections** → identify which to merge
2. **For each external collection:**
   - Export metadata (embedding model, chunking strategy, etc.)
   - Get all documents and embeddings
   - Create target collection in your project with matching metadata
   - Batch insert documents in groups of 100-1000
3. **Validate:** Test retrieval on merged collection
4. **Archive:** Keep backup of external ChromaDB

### **Why This Approach?**
- ✅ Safe (no direct file manipulation)
- ✅ Controllable (can inspect data during transfer)
- ✅ Traceable (logs what was merged)
- ✅ Flexible (can transform data if needed)
- ✅ Recoverable (original external collection untouched)

---

## Example Data Flow

```
External ChromaDB
├── Collection: "medical_docs_dense_mpnet"
│   ├── 5000 documents
│   ├── Embeddings: 768-dim (all-mpnet-base-v2)
│   └── Metadata: chunking_strategy, chunk_size, etc.
│
└── [Extract documents, embeddings, metadata]
    ↓
Your Project's ChromaDB
├── New Collection: "medical_docs_dense_mpnet_imported"
│   ├── Add 5000 documents in batches
│   ├── Add corresponding embeddings
│   ├── Add matching metadata
│   └── Verify count: 5000 documents ✓
    ↓
Test & Validate
├── Query retrieval works ✓
├── Evaluation metrics compute ✓
└── Merged collection ready for use ✓
```

---

## Summary

| Step | Action | Time | Complexity |
|------|--------|------|-----------|
| 1 | Identify collection info | 5 min | Low |
| 2 | Verify embedding model | 5 min | Low |
| 3 | Backup current data | 5 min | Low |
| 4 | Perform merge | 10-30 min | Medium |
| 5 | Verify merge success | 10 min | Medium |
| **Total** | Complete merge | **35-55 min** | **Medium** |

---

## Next Steps

Please provide:
1. **External ChromaDB path:** Where is the external ChromaDB located?
2. **Collection name:** What's the external collection called?
3. **Embedding model:** Which embedding model does it use?
4. **Document count:** Approximately how many documents?
5. **Metadata:** What metadata is stored (chunking strategy, chunk size, etc.)?

Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.