Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /CHROMADB_MERGE_GUIDE.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

8.62 kB

Merging External ChromaDB Collections

Overview

Guide to merge a ChromaDB collection created outside your project into your RAG Capstone project's ChromaDB instance.

Prerequisites

Source ChromaDB: The external collection must be accessible
Target ChromaDB: Your project's ChromaDB (located at ./chroma_db by default)
Matching Embedding Model: Both collections should use the same embedding model for consistency
ChromaDB Version Compatibility: Ensure both are using compatible ChromaDB versions

Step-by-Step Merge Process

Step 1: Identify Collection Information

From the External ChromaDB:

- Source directory path: /path/to/external/chroma_db
- Collection name: (e.g., "medical_docs_dense_mpnet")
- Embedding model used: (e.g., "sentence-transformers/all-mpnet-base-v2")
- Chunking strategy: (e.g., "dense", "sparse", "hybrid")
- Chunk size: (e.g., 512)
- Chunk overlap: (e.g., 50)
- Total documents/chunks: ?

From Your Project:

- Target directory: ./chroma_db (default, or configured in settings)
- Existing collections: ?
- Available embedding models: (check config.py)

Step 2: Verify Embedding Model Compatibility

Check if the external collection's embedding model is available in your project:

From config.py, the available embedding models are:

- sentence-transformers/all-mpnet-base-v2
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/multilingual-MiniLM-L12-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- allenai/specter
- gemini-embedding-001

If NOT in the list:

Add the external embedding model to config.py embedding_models list
Or re-embed all documents with a compatible model (more complex)

Step 3: Prepare the External Collection Data

Option A: Direct Copy of ChromaDB Directory (Fastest)

1. Locate external ChromaDB directory structure
2. Copy the external collection files to your ./chroma_db directory
3. ChromaDB will recognize and load them

Directory structure:
  ./chroma_db/
    ├── 0/
    │   ├── data/
    │   │   ├── documents.parquet
    │   │   ├── embeddings.parquet
    │   │   └── metadatas.parquet
    │   └── chroma.sqlite3

Option B: Export and Re-import (Recommended) Extract all documents and metadata from external collection, then import into your collection

Implementation Approaches

Approach 1: Manual Directory Merge

Steps:

Stop the project (stop Streamlit app)
Back up your current ./chroma_db directory
Copy external collection files to ./chroma_db
Restart the project
Verify collection appears in "Existing Collections" dropdown

Pros: Fast, preserves embeddings Cons: Risk of conflicts if same collection name exists

Approach 2: Programmatic Merge (Recommended)

High-level process:

1. Connect to external ChromaDB
   ├─ Load external collection
   ├─ Extract all documents, embeddings, and metadata
   
2. Prepare target ChromaDB
   ├─ Create/get target collection in your project
   ├─ Match embedding model and metadata
   
3. Transfer documents
   ├─ Batch transfer documents to target collection
   ├─ Verify all documents transferred
   ├─ Handle duplicates (if any)
   
4. Verify merge
   ├─ Count documents match
   ├─ Test retrieval works
   ├─ Validate embeddings are correct

Approach 3: Using ChromaDB Export/Import

Steps:

Export from external ChromaDB:

- Get all collections
- For each collection:
  * Get collection metadata
  * Export all documents + embeddings + metadata
  * Save to JSON/Parquet files

Import to your ChromaDB:

- Create new collection with same metadata
- Add documents + embeddings + metadata in batches
- Verify document count and samples

Handling Potential Issues

Issue 1: Different Embedding Models

Problem: External collection uses embedding model not in your project

Solution:

Option A: Add model to config.py and ensure it's installed
Option B: Re-embed with a compatible model (requires space and time)
Option C: Use Gemini API for embeddings if configured

Issue 2: Duplicate Collection Names

Problem: External collection has same name as existing collection

Solution:

Rename the external collection before importing
Or merge into existing collection (combines data)

Issue 3: Different ChromaDB Versions

Problem: External ChromaDB version incompatible with project

Solution:

Export to common format (JSON/CSV)
Re-import with compatible ChromaDB version
Update ChromaDB: pip install --upgrade chromadb

Issue 4: Metadata Mismatch

Problem: External collection metadata schema different from project

Solution:

Map external metadata to project metadata structure
Add missing fields (chunking_strategy, chunk_size, etc.)
Preserve original metadata for reference

Verification Checklist

After merging, verify:

✅ Collection appears in "Existing Collections" dropdown in Streamlit
✅ Can load collection without errors
✅ Document count matches expected total
✅ Can query and retrieve documents (test with sample question)
✅ Retrieved documents have correct embeddings
✅ Metadata is preserved correctly
✅ Evaluation metrics run without errors on merged collection
✅ Both original and imported documents retrieve with correct distances

Quick Reference: Manual Merge Steps

If external collection is already in ChromaDB format:

Backup your current collection:
```
cp -r ./chroma_db ./chroma_db.backup
```
Find external ChromaDB location:
```
/path/to/external/chroma_db
```

Copy collection files:

Copy everything from /path/to/external/chroma_db to ./chroma_db

Restart Streamlit:
```
streamlit run streamlit_app.py
```
Check Collections dropdown:
- External collection should now appear

Recommended Merge Approach for Your Project

Best Practice: Programmatic Approach

List all external collections → identify which to merge
For each external collection:
- Export metadata (embedding model, chunking strategy, etc.)
- Get all documents and embeddings
- Create target collection in your project with matching metadata
- Batch insert documents in groups of 100-1000
Validate: Test retrieval on merged collection
Archive: Keep backup of external ChromaDB

Why This Approach?

✅ Safe (no direct file manipulation)
✅ Controllable (can inspect data during transfer)
✅ Traceable (logs what was merged)
✅ Flexible (can transform data if needed)
✅ Recoverable (original external collection untouched)

Example Data Flow

External ChromaDB
├── Collection: "medical_docs_dense_mpnet"
│   ├── 5000 documents
│   ├── Embeddings: 768-dim (all-mpnet-base-v2)
│   └── Metadata: chunking_strategy, chunk_size, etc.
│
└── [Extract documents, embeddings, metadata]
    ↓
Your Project's ChromaDB
├── New Collection: "medical_docs_dense_mpnet_imported"
│   ├── Add 5000 documents in batches
│   ├── Add corresponding embeddings
│   ├── Add matching metadata
│   └── Verify count: 5000 documents ✓
    ↓
Test & Validate
├── Query retrieval works ✓
├── Evaluation metrics compute ✓
└── Merged collection ready for use ✓

Summary

Step	Action	Time	Complexity
1	Identify collection info	5 min	Low
2	Verify embedding model	5 min	Low
3	Backup current data	5 min	Low
4	Perform merge	10-30 min	Medium
5	Verify merge success	10 min	Medium
Total	Complete merge	35-55 min	Medium

Next Steps

Please provide:

External ChromaDB path: Where is the external ChromaDB located?
Collection name: What's the external collection called?
Embedding model: Which embedding model does it use?
Document count: Approximately how many documents?
Metadata: What metadata is stored (chunking strategy, chunk size, etc.)?

Once you provide these details, I can create a specific merge script or detailed guidance tailored to your exact scenario.