Spaces:
Sleeping
Sleeping
File size: 8,622 Bytes
1d10b0a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
# Merging External ChromaDB Collections
## Overview
Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.
## Prerequisites
1. **Source ChromaDB**: The external collection must be accessible
2. **Target ChromaDB**: Your project's ChromaDB (located at `./chroma_db` by default)
3. **Matching Embedding Model**: Both collections should use the same embedding model for consistency
4. **ChromaDB Version Compatibility**: Ensure both are using compatible ChromaDB versions
---
## Step-by-Step Merge Process
### **Step 1: Identify Collection Information**
**From the External ChromaDB:**
```
- Source directory path: /path/to/external/chroma_db
- Collection name: (e.g., "medical_docs_dense_mpnet")
- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
- Chunk size: (e.g., 512)
- Chunk overlap: (e.g., 50)
- Total documents/chunks: ?
```
**From Your Project:**
```
- Target directory: ./chroma_db (default, or configured in settings)
- Existing collections: ?
- Available embedding models: (check config.py)
```
### **Step 2: Verify Embedding Model Compatibility**
**Check if the external collection's embedding model is available in your project:**
From `config.py`, the available embedding models are:
```
- sentence-transformers/all-mpnet-base-v2
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- allenai/specter
- gemini-embedding-001
```
**If NOT in the list:**
- Add the external embedding model to `config.py` embedding_models list
- Or re-embed all documents with a compatible model (more complex)
### **Step 3: Prepare the External Collection Data**
**Option A: Direct Copy of ChromaDB Directory** (Fastest)
```
1. Locate external ChromaDB directory structure
2. Copy the external collection files to your ./chroma_db directory
3. ChromaDB will recognize and load them
Directory structure:
./chroma_db/
βββ 0/
β βββ data/
β β βββ documents.parquet
β β βββ embeddings.parquet
β β βββ metadatas.parquet
β βββ chroma.sqlite3
```
**Option B: Export and Re-import** (Recommended)
Extract all documents and metadata from external collection, then import into your collection
---
## Implementation Approaches
### **Approach 1: Manual Directory Merge**
**Steps:**
1. Stop the project (stop Streamlit app)
2. Back up your current `./chroma_db` directory
3. Copy external collection files to `./chroma_db`
4. Restart the project
5. Verify collection appears in "Existing Collections" dropdown
**Pros:** Fast, preserves embeddings
**Cons:** Risk of conflicts if same collection name exists
---
### **Approach 2: Programmatic Merge (Recommended)**
**High-level process:**
```
1. Connect to external ChromaDB
ββ Load external collection
ββ Extract all documents, embeddings, and metadata
2. Prepare target ChromaDB
ββ Create/get target collection in your project
ββ Match embedding model and metadata
3. Transfer documents
ββ Batch transfer documents to target collection
ββ Verify all documents transferred
ββ Handle duplicates (if any)
4. Verify merge
ββ Count documents match
ββ Test retrieval works
ββ Validate embeddings are correct
```
---
### **Approach 3: Using ChromaDB Export/Import**
**Steps:**
1. **Export from external ChromaDB:**
```
- Get all collections
- For each collection:
* Get collection metadata
* Export all documents + embeddings + metadata
* Save to JSON/Parquet files
```
2. **Import to your ChromaDB:**
```
- Create new collection with same metadata
- Add documents + embeddings + metadata in batches
- Verify document count and samples
```
---
## Handling Potential Issues
### **Issue 1: Different Embedding Models**
**Problem:** External collection uses embedding model not in your project
**Solution:**
- Option A: Add model to `config.py` and ensure it's installed
- Option B: Re-embed with a compatible model (requires space and time)
- Option C: Use Gemini API for embeddings if configured
### **Issue 2: Duplicate Collection Names**
**Problem:** External collection has same name as existing collection
**Solution:**
- Rename the external collection before importing
- Or merge into existing collection (combines data)
### **Issue 3: Different ChromaDB Versions**
**Problem:** External ChromaDB version incompatible with project
**Solution:**
- Export to common format (JSON/CSV)
- Re-import with compatible ChromaDB version
- Update ChromaDB: `pip install --upgrade chromadb`
### **Issue 4: Metadata Mismatch**
**Problem:** External collection metadata schema different from project
**Solution:**
- Map external metadata to project metadata structure
- Add missing fields (chunking_strategy, chunk_size, etc.)
- Preserve original metadata for reference
---
## Verification Checklist
After merging, verify:
- β
Collection appears in "Existing Collections" dropdown in Streamlit
- β
Can load collection without errors
- β
Document count matches expected total
- β
Can query and retrieve documents (test with sample question)
- β
Retrieved documents have correct embeddings
- β
Metadata is preserved correctly
- β
Evaluation metrics run without errors on merged collection
- β
Both original and imported documents retrieve with correct distances
---
## Quick Reference: Manual Merge Steps
If external collection is already in ChromaDB format:
1. **Backup your current collection:**
```
cp -r ./chroma_db ./chroma_db.backup
```
2. **Find external ChromaDB location:**
```
/path/to/external/chroma_db
```
3. **Copy collection files:**
```
Copy everything from /path/to/external/chroma_db to ./chroma_db
```
4. **Restart Streamlit:**
```
streamlit run streamlit_app.py
```
5. **Check Collections dropdown:**
- External collection should now appear
---
## Recommended Merge Approach for Your Project
### **Best Practice: Programmatic Approach**
1. **List all external collections** β identify which to merge
2. **For each external collection:**
- Export metadata (embedding model, chunking strategy, etc.)
- Get all documents and embeddings
- Create target collection in your project with matching metadata
- Batch insert documents in groups of 100-1000
3. **Validate:** Test retrieval on merged collection
4. **Archive:** Keep backup of external ChromaDB
### **Why This Approach?**
- β
Safe (no direct file manipulation)
- β
Controllable (can inspect data during transfer)
- β
Traceable (logs what was merged)
- β
Flexible (can transform data if needed)
- β
Recoverable (original external collection untouched)
---
## Example Data Flow
```
External ChromaDB
βββ Collection: "medical_docs_dense_mpnet"
β βββ 5000 documents
β βββ Embeddings: 768-dim (all-mpnet-base-v2)
β βββ Metadata: chunking_strategy, chunk_size, etc.
β
βββ [Extract documents, embeddings, metadata]
β
Your Project's ChromaDB
βββ New Collection: "medical_docs_dense_mpnet_imported"
β βββ Add 5000 documents in batches
β βββ Add corresponding embeddings
β βββ Add matching metadata
β βββ Verify count: 5000 documents β
β
Test & Validate
βββ Query retrieval works β
βββ Evaluation metrics compute β
βββ Merged collection ready for use β
```
---
## Summary
| Step | Action | Time | Complexity |
|------|--------|------|-----------|
| 1 | Identify collection info | 5 min | Low |
| 2 | Verify embedding model | 5 min | Low |
| 3 | Backup current data | 5 min | Low |
| 4 | Perform merge | 10-30 min | Medium |
| 5 | Verify merge success | 10 min | Medium |
| **Total** | Complete merge | **35-55 min** | **Medium** |
---
## Next Steps
Please provide:
1. **External ChromaDB path:** Where is the external ChromaDB located?
2. **Collection name:** What's the external collection called?
3. **Embedding model:** Which embedding model does it use?
4. **Document count:** Approximately how many documents?
5. **Metadata:** What metadata is stored (chunking strategy, chunk size, etc.)?
Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.
|