VedaMD-Backend-v2 / PIPELINE_GUIDE.md
sniro23's picture
Production ready: Clean codebase + Cerebras + Automated pipeline
b4971bd
# VedaMD Document Pipeline Guide
**Complete guide for adding and managing medical documents in VedaMD**
---
## Table of Contents
1. [Overview](#overview)
2. [Quick Start](#quick-start)
3. [Building Vector Store from Scratch](#building-vector-store-from-scratch)
4. [Adding Single Documents](#adding-single-documents)
5. [Updating Existing Documents](#updating-existing-documents)
6. [Uploading to Hugging Face](#uploading-to-hugging-face)
7. [Advanced Usage](#advanced-usage)
8. [Troubleshooting](#troubleshooting)
---
## Overview
### What is the Pipeline?
The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system.
**Before Pipeline** (Manual Process):
```
PDF → Extract Text → Chunk → Embed → Build FAISS → Upload to HF
↓ ↓ ↓ ↓ ↓ ↓
Hours Manual Script Script External Manual
Work Needed Needed Tool Upload
```
**With Pipeline** (Automated):
```
PDF → python add_document.py file.pdf → Done ✅
Minutes
```
### Pipeline Components
1. **build_vector_store.py** - Build complete vector store from directory of PDFs
2. **add_document.py** - Add single documents to existing vector store
3. **Automatic Features**:
- PDF text extraction (PyMuPDF, pdfplumber, OCR fallback)
- Smart medical chunking
- Duplicate detection
- Quality validation
- HF Hub integration
- Automatic backups
---
## Quick Start
### Prerequisites
All required packages are already installed in your `.venv`:
- ✅ PyMuPDF (PDF extraction)
- ✅ pdfplumber (backup PDF extraction)
- ✅ sentence-transformers (embeddings)
- ✅ faiss-cpu (vector indexing)
- ✅ huggingface-hub (uploading)
### 30-Second Test
```bash
# Activate environment
cd "/Users/niro/Documents/SL Clinical Assistant"
source .venv/bin/activate
# Build vector store from your existing PDFs
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
# That's it! ✅
```
---
## Building Vector Store from Scratch
### Basic Usage
Build a vector store from all PDFs in a directory:
```bash
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
```
**Expected output:**
```
🚀 STARTING VECTOR STORE BUILD
============================================================
🔍 Scanning for PDFs in Obs
✅ Found 15 PDF files
📄 Breech.pdf
📄 RhESUS.pdf
... (13 more)
============================================================
📄 Processing: Breech.pdf
============================================================
📄 Extracting with PyMuPDF: Obs/Breech.pdf
✅ Extracted 1988 characters from 1 pages
📝 Chunking text from Breech.pdf
✅ Created 2 chunks from Breech.pdf
🧮 Generating embeddings for 2 chunks...
✅ Processed Breech.pdf: 2 chunks added
... (processes all PDFs)
============================================================
✅ BUILD COMPLETE!
============================================================
📊 Summary:
• PDFs processed: 15
• Total chunks: 247
• Embedding dimension: 384
• Output directory: ./data/vector_store
• Build time: 45.23 seconds
============================================================
```
### Customizing Chunk Size
For longer/shorter chunks:
```bash
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--chunk-size 1500 \
--chunk-overlap 150
```
**Recommendations:**
- **chunk-size**: 800-1200 (default: 1000)
- **chunk-overlap**: 50-200 (default: 100)
- Smaller chunks = more precise retrieval
- Larger chunks = better context
### Using Different Embedding Model
```bash
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--embedding-model "sentence-transformers/all-mpnet-base-v2"
```
**Available models:**
- `all-MiniLM-L6-v2` (default) - Fast, 384d, good quality
- `all-mpnet-base-v2` - Better quality, 768d, slower
- `multi-qa-mpnet-base-dot-v1` - Optimized for Q&A
### Build and Upload to HF
```bash
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
```
**Note**: Requires `HF_TOKEN` environment variable or `--hf-token` argument
---
## Adding Single Documents
### Basic Usage
Add a new guideline to existing vector store:
```bash
python scripts/add_document.py \
--file ./new_guideline.pdf \
--citation "SLCOG Hypertension Guidelines 2025" \
--category "Obstetrics" \
--vector-store-dir ./data/vector_store
```
**Expected output:**
```
============================================================
📄 Adding document: new_guideline.pdf
============================================================
📄 Extracting with PyMuPDF: ./new_guideline.pdf
✅ Extracted 12,456 characters from 8 pages
🔑 File hash: a3f2c9d8e1b0...
🔍 Checking for duplicates...
✅ No duplicates found
📝 Created 14 chunks
🧮 Generating embeddings...
📊 Adding to FAISS index...
✅ Added 14 chunks to vector store
📊 New total: 261 vectors
============================================================
💾 Saving updated vector store...
============================================================
📦 Backup created: data/vector_store/backups/20251023_150000
✅ Saved FAISS index
✅ Saved documents
✅ Saved metadata
✅ Updated config
============================================================
✅ DOCUMENT ADDED SUCCESSFULLY!
============================================================
📊 Summary:
• Chunks added: 14
• Total vectors: 261
• Time taken: 8.43 seconds
============================================================
```
### Add and Upload to HF
```bash
python scripts/add_document.py \
--file ./new_guideline.pdf \
--citation "WHO Guidelines 2025" \
--vector-store-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
```
### Allow Duplicates
By default, duplicate detection is enabled. To force add:
```bash
python scripts/add_document.py \
--file ./updated_guideline.pdf \
--vector-store-dir ./data/vector_store \
--no-duplicate-check
```
---
## Updating Existing Documents
To update an existing guideline:
1. **Add new version** (recommended):
```bash
python scripts/add_document.py \
--file ./guidelines_v2.pdf \
--citation "SLCOG Hypertension Guidelines 2025 v2" \
--vector-store-dir ./data/vector_store
```
2. **Rebuild from scratch** (if major changes):
```bash
# Move old PDFs to archive
mkdir -p Obs/archive
mv Obs/old_guideline.pdf Obs/archive/
# Add new version
cp ~/Downloads/new_guideline.pdf Obs/
# Rebuild
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
```
---
## Uploading to Hugging Face
### Setup HF Token
```bash
# Option 1: Environment variable (recommended)
export HF_TOKEN="hf_your_token_here"
# Option 2: Pass as argument
python scripts/build_vector_store.py --hf-token "hf_your_token_here" ...
```
### Initial Upload
```bash
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
```
### Incremental Upload
After adding a document:
```bash
python scripts/add_document.py \
--file ./new.pdf \
--vector-store-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store
```
### What Gets Uploaded
- ✅ `faiss_index.bin` - FAISS vector index
- ✅ `documents.json` - Document chunks
- ✅ `metadata.json` - Citations, sources, sections
- ✅ `config.json` - Configuration settings
- ✅ `build_log.json` - Build information
---
## Advanced Usage
### Batch Processing Multiple Files
```bash
# Create a script to add multiple files
for pdf in new_guidelines/*.pdf; do
python scripts/add_document.py \
--file "$pdf" \
--citation "$(basename "$pdf" .pdf)" \
--vector-store-dir ./data/vector_store
done
# Then upload once
python scripts/add_document.py \
--file dummy.pdf \
--vector-store-dir ./data/vector_store \
--upload \
--repo-id sniro23/VedaMD-Vector-Store \
--no-duplicate-check
```
### Inspecting Vector Store
```bash
# View config
cat data/vector_store/config.json
# View build log
cat data/vector_store/build_log.json | python -m json.tool
# Count documents
python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))"
# List sources
python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))"
```
### Backup Management
Backups are created automatically in `data/vector_store/backups/`:
```bash
# List backups
ls -lh data/vector_store/backups/
# Restore from backup (if needed)
cp data/vector_store/backups/20251023_150000/* data/vector_store/
```
### Quality Checks
Check extraction quality for a specific PDF:
```python
from scripts.build_vector_store import PDFExtractor
text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf")
print(f"Extracted {len(text)} characters")
print(f"Pages: {metadata['pages']}")
print(f"Method: {metadata['method']}")
print(f"\nFirst 500 chars:\n{text[:500]}")
```
---
## Troubleshooting
### Issue: "No PDF files found"
**Solution:**
```bash
# Check directory exists
ls -la ./Obs
# Use absolute path
python scripts/build_vector_store.py \
--input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \
--output-dir ./data/vector_store
```
### Issue: "Extracted text too short"
**Causes:**
- Scanned PDF (image-based)
- Encrypted PDF
- Corrupted PDF
**Solution:**
```bash
# Check PDF manually
open Obs/problematic.pdf
# Try with OCR (requires tesseract)
pip install pytesseract
# Script will auto-fallback to OCR
```
### Issue: "Embedding dimension mismatch"
**Solution:**
```bash
# Check existing config
cat data/vector_store/config.json
# Rebuild with same model
python scripts/build_vector_store.py \
--embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
--input-dir ./Obs \
--output-dir ./data/vector_store
```
### Issue: "Upload failed"
**Solution:**
```bash
# Check HF token
echo $HF_TOKEN
# Test token
python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())"
# Create repo first
python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)"
```
### Issue: "Out of memory"
**Solution:**
```bash
# Reduce batch size in script (edit build_vector_store.py)
# Line ~338: change batch_size=32 to batch_size=8
# Or process PDFs in smaller batches
mkdir -p Obs/batch1 Obs/batch2
# Move PDFs into batches
python scripts/build_vector_store.py --input-dir Obs/batch1 ...
python scripts/add_document.py --file Obs/batch2/*.pdf ...
```
### Issue: "Duplicate detected but I want to update"
**Solution:**
```bash
# Option 1: Force add (creates duplicate)
python scripts/add_document.py \
--file ./updated.pdf \
--no-duplicate-check \
--vector-store-dir ./data/vector_store
# Option 2: Rebuild from scratch
python scripts/build_vector_store.py \
--input-dir ./Obs \
--output-dir ./data/vector_store
```
---
## Best Practices
### 1. Organize Your PDFs
```
Obs/
├── obstetrics/
│ ├── preeclampsia.pdf
│ ├── hemorrhage.pdf
│ └── ...
├── cardiology/
│ └── ...
└── general/
└── ...
```
### 2. Use Meaningful Citations
```bash
# Good
--citation "SLCOG Preeclampsia Management Guidelines 2025"
# Bad
--citation "guideline.pdf"
```
### 3. Regular Backups
```bash
# Before major changes
cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d)
```
### 4. Test Before Uploading
```bash
# Build locally first
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs
# Test with RAG system
# Then upload
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload
```
### 5. Version Control
Add to `.gitignore`:
```
data/vector_store/
test_vector_store/
*.log
backups/
```
Keep in Git:
```
scripts/
Obs/
requirements.txt
```
---
## Integration with VedaMD
### Using Your Vector Store
After building, update your RAG system:
```python
# In enhanced_groq_medical_rag.py or wherever vector store is loaded
# Option 1: Load from local directory
vector_store = SimpleVectorStore("./data/vector_store")
# Option 2: Load from HF Hub
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
```
### Automatic Reloading
For production, reload vector store periodically:
```python
import schedule
import time
def reload_vector_store():
global vector_store
vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
logger.info("✅ Vector store reloaded")
# Reload every 6 hours
schedule.every(6).hours.do(reload_vector_store)
while True:
schedule.run_pending()
time.sleep(60)
```
---
## Next Steps
1. **Build your initial vector store:**
```bash
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store
```
2. **Upload to HF:**
```bash
python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store
```
3. **Test with RAG system:**
```bash
python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))"
```
4. **Add new documents as they arrive:**
```bash
python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload
```
---
**Questions or Issues?**
Check the logs:
- `vector_store_build.log` - Build process
- `add_document.log` - Document additions
Or review the scripts:
- [scripts/build_vector_store.py](scripts/build_vector_store.py)
- [scripts/add_document.py](scripts/add_document.py)
---
**Last Updated**: October 23, 2025
**Version**: 1.0.0