Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /CHROMADB_MERGE_GUIDE.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

8.62 kB

	# Merging External ChromaDB Collections

	## Overview

	Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.

	## Prerequisites

	1. Source ChromaDB: The external collection must be accessible
	2. Target ChromaDB: Your project's ChromaDB (located at `./chroma_db` by default)
	3. Matching Embedding Model: Both collections should use the same embedding model for consistency
	4. ChromaDB Version Compatibility: Ensure both are using compatible ChromaDB versions

	---

	## Step-by-Step Merge Process

	### Step 1: Identify Collection Information

	From the External ChromaDB:
	```
	- Source directory path: /path/to/external/chroma_db
	- Collection name: (e.g., "medical_docs_dense_mpnet")
	- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
	- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
	- Chunk size: (e.g., 512)
	- Chunk overlap: (e.g., 50)
	- Total documents/chunks: ?
	```

	From Your Project:
	```
	- Target directory: ./chroma_db (default, or configured in settings)
	- Existing collections: ?
	- Available embedding models: (check config.py)
	```

	### Step 2: Verify Embedding Model Compatibility

	Check if the external collection's embedding model is available in your project:

	From `config.py`, the available embedding models are:
	```
	- sentence-transformers/all-mpnet-base-v2
	- emilyalsentzer/Bio_ClinicalBERT
	- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
	- sentence-transformers/all-MiniLM-L6-v2
	- sentence-transformers/multilingual-MiniLM-L12-v2
	- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	- allenai/specter
	- gemini-embedding-001
	```

	If NOT in the list:
	- Add the external embedding model to `config.py` embedding_models list
	- Or re-embed all documents with a compatible model (more complex)

	### Step 3: Prepare the External Collection Data

	Option A: Direct Copy of ChromaDB Directory (Fastest)
	```
	1. Locate external ChromaDB directory structure
	2. Copy the external collection files to your ./chroma_db directory
	3. ChromaDB will recognize and load them

	Directory structure:
	./chroma_db/
	├── 0/
	│ ├── data/
	│ │ ├── documents.parquet
	│ │ ├── embeddings.parquet
	│ │ └── metadatas.parquet
	│ └── chroma.sqlite3
	```

	Option B: Export and Re-import (Recommended)
	Extract all documents and metadata from external collection, then import into your collection

	---

	## Implementation Approaches

	### Approach 1: Manual Directory Merge

	Steps:
	1. Stop the project (stop Streamlit app)
	2. Back up your current `./chroma_db` directory
	3. Copy external collection files to `./chroma_db`
	4. Restart the project
	5. Verify collection appears in "Existing Collections" dropdown

	Pros: Fast, preserves embeddings
	Cons: Risk of conflicts if same collection name exists

	---

	### Approach 2: Programmatic Merge (Recommended)

	High-level process:

	```
	1. Connect to external ChromaDB
	├─ Load external collection
	├─ Extract all documents, embeddings, and metadata

	2. Prepare target ChromaDB
	├─ Create/get target collection in your project
	├─ Match embedding model and metadata

	3. Transfer documents
	├─ Batch transfer documents to target collection
	├─ Verify all documents transferred
	├─ Handle duplicates (if any)

	4. Verify merge
	├─ Count documents match
	├─ Test retrieval works
	├─ Validate embeddings are correct
	```

	---

	### Approach 3: Using ChromaDB Export/Import

	Steps:

	1. Export from external ChromaDB:
	```
	- Get all collections
	- For each collection:
	* Get collection metadata
	* Export all documents + embeddings + metadata
	* Save to JSON/Parquet files
	```

	2. Import to your ChromaDB:
	```
	- Create new collection with same metadata
	- Add documents + embeddings + metadata in batches
	- Verify document count and samples
	```

	---

	## Handling Potential Issues

	### Issue 1: Different Embedding Models

	Problem: External collection uses embedding model not in your project

	Solution:
	- Option A: Add model to `config.py` and ensure it's installed
	- Option B: Re-embed with a compatible model (requires space and time)
	- Option C: Use Gemini API for embeddings if configured

	### Issue 2: Duplicate Collection Names

	Problem: External collection has same name as existing collection

	Solution:
	- Rename the external collection before importing
	- Or merge into existing collection (combines data)

	### Issue 3: Different ChromaDB Versions

	Problem: External ChromaDB version incompatible with project

	Solution:
	- Export to common format (JSON/CSV)
	- Re-import with compatible ChromaDB version
	- Update ChromaDB: `pip install --upgrade chromadb`

	### Issue 4: Metadata Mismatch

	Problem: External collection metadata schema different from project

	Solution:
	- Map external metadata to project metadata structure
	- Add missing fields (chunking_strategy, chunk_size, etc.)
	- Preserve original metadata for reference

	---

	## Verification Checklist

	After merging, verify:

	- ✅ Collection appears in "Existing Collections" dropdown in Streamlit
	- ✅ Can load collection without errors
	- ✅ Document count matches expected total
	- ✅ Can query and retrieve documents (test with sample question)
	- ✅ Retrieved documents have correct embeddings
	- ✅ Metadata is preserved correctly
	- ✅ Evaluation metrics run without errors on merged collection
	- ✅ Both original and imported documents retrieve with correct distances

	---

	## Quick Reference: Manual Merge Steps

	If external collection is already in ChromaDB format:

	1. Backup your current collection:
	```
	cp -r ./chroma_db ./chroma_db.backup
	```

	2. Find external ChromaDB location:
	```
	/path/to/external/chroma_db
	```

	3. Copy collection files:
	```
	Copy everything from /path/to/external/chroma_db to ./chroma_db
	```

	4. Restart Streamlit:
	```
	streamlit run streamlit_app.py
	```

	5. Check Collections dropdown:
	- External collection should now appear

	---

	## Recommended Merge Approach for Your Project

	### Best Practice: Programmatic Approach

	1. List all external collections → identify which to merge
	2. For each external collection:
	- Export metadata (embedding model, chunking strategy, etc.)
	- Get all documents and embeddings
	- Create target collection in your project with matching metadata
	- Batch insert documents in groups of 100-1000
	3. Validate: Test retrieval on merged collection
	4. Archive: Keep backup of external ChromaDB

	### Why This Approach?
	- ✅ Safe (no direct file manipulation)
	- ✅ Controllable (can inspect data during transfer)
	- ✅ Traceable (logs what was merged)
	- ✅ Flexible (can transform data if needed)
	- ✅ Recoverable (original external collection untouched)

	---

	## Example Data Flow

	```
	External ChromaDB
	├── Collection: "medical_docs_dense_mpnet"
	│ ├── 5000 documents
	│ ├── Embeddings: 768-dim (all-mpnet-base-v2)
	│ └── Metadata: chunking_strategy, chunk_size, etc.
	│
	└── [Extract documents, embeddings, metadata]
	↓
	Your Project's ChromaDB
	├── New Collection: "medical_docs_dense_mpnet_imported"
	│ ├── Add 5000 documents in batches
	│ ├── Add corresponding embeddings
	│ ├── Add matching metadata
	│ └── Verify count: 5000 documents ✓
	↓
	Test & Validate
	├── Query retrieval works ✓
	├── Evaluation metrics compute ✓
	└── Merged collection ready for use ✓
	```

	---

	## Summary

	\| Step \| Action \| Time \| Complexity \|
	\|------\|--------\|------\|-----------\|
	\| 1 \| Identify collection info \| 5 min \| Low \|
	\| 2 \| Verify embedding model \| 5 min \| Low \|
	\| 3 \| Backup current data \| 5 min \| Low \|
	\| 4 \| Perform merge \| 10-30 min \| Medium \|
	\| 5 \| Verify merge success \| 10 min \| Medium \|
	\| Total \| Complete merge \| 35-55 min \| Medium \|

	---

	## Next Steps

	Please provide:
	1. External ChromaDB path: Where is the external ChromaDB located?
	2. Collection name: What's the external collection called?
	3. Embedding model: Which embedding model does it use?
	4. Document count: Approximately how many documents?
	5. Metadata: What metadata is stored (chunking strategy, chunk size, etc.)?

	Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.