Spaces:

sniro23
/

VedaMD-Backend-v2

Sleeping

App Files Files Community

VedaMD-Backend-v2 / PIPELINE_GUIDE.md

sniro23

Production ready: Clean codebase + Cerebras + Automated pipeline

b4971bd 2 months ago

preview code

raw

history blame contribute delete

14.1 kB

	# VedaMD Document Pipeline Guide

	Complete guide for adding and managing medical documents in VedaMD

	---

	## Table of Contents

	1. [Overview](#overview)
	2. [Quick Start](#quick-start)
	3. [Building Vector Store from Scratch](#building-vector-store-from-scratch)
	4. [Adding Single Documents](#adding-single-documents)
	5. [Updating Existing Documents](#updating-existing-documents)
	6. [Uploading to Hugging Face](#uploading-to-hugging-face)
	7. [Advanced Usage](#advanced-usage)
	8. [Troubleshooting](#troubleshooting)

	---

	## Overview

	### What is the Pipeline?

	The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system.

	Before Pipeline (Manual Process):
	```
	PDF → Extract Text → Chunk → Embed → Build FAISS → Upload to HF
	↓ ↓ ↓ ↓ ↓ ↓
	Hours Manual Script Script External Manual
	Work Needed Needed Tool Upload
	```

	With Pipeline (Automated):
	```
	PDF → python add_document.py file.pdf → Done ✅
	↓
	Minutes
	```

	### Pipeline Components

	1. build_vector_store.py - Build complete vector store from directory of PDFs
	2. add_document.py - Add single documents to existing vector store
	3. Automatic Features:
	- PDF text extraction (PyMuPDF, pdfplumber, OCR fallback)
	- Smart medical chunking
	- Duplicate detection
	- Quality validation
	- HF Hub integration
	- Automatic backups

	---

	## Quick Start

	### Prerequisites

	All required packages are already installed in your `.venv`:
	- ✅ PyMuPDF (PDF extraction)
	- ✅ pdfplumber (backup PDF extraction)
	- ✅ sentence-transformers (embeddings)
	- ✅ faiss-cpu (vector indexing)
	- ✅ huggingface-hub (uploading)

	### 30-Second Test

	```bash
	# Activate environment
	cd "/Users/niro/Documents/SL Clinical Assistant"
	source .venv/bin/activate

	# Build vector store from your existing PDFs
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store

	# That's it! ✅
	```

	---

	## Building Vector Store from Scratch

	### Basic Usage

	Build a vector store from all PDFs in a directory:

	```bash
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store
	```

	Expected output:
	```
	🚀 STARTING VECTOR STORE BUILD
	============================================================

	🔍 Scanning for PDFs in Obs
	✅ Found 15 PDF files
	📄 Breech.pdf
	📄 RhESUS.pdf
	... (13 more)

	============================================================
	📄 Processing: Breech.pdf
	============================================================
	📄 Extracting with PyMuPDF: Obs/Breech.pdf
	✅ Extracted 1988 characters from 1 pages
	📝 Chunking text from Breech.pdf
	✅ Created 2 chunks from Breech.pdf
	🧮 Generating embeddings for 2 chunks...
	✅ Processed Breech.pdf: 2 chunks added

	... (processes all PDFs)

	============================================================
	✅ BUILD COMPLETE!
	============================================================
	📊 Summary:
	• PDFs processed: 15
	• Total chunks: 247
	• Embedding dimension: 384
	• Output directory: ./data/vector_store
	• Build time: 45.23 seconds
	============================================================
	```

	### Customizing Chunk Size

	For longer/shorter chunks:

	```bash
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store \
	--chunk-size 1500 \
	--chunk-overlap 150
	```

	Recommendations:
	- chunk-size: 800-1200 (default: 1000)
	- chunk-overlap: 50-200 (default: 100)
	- Smaller chunks = more precise retrieval
	- Larger chunks = better context

	### Using Different Embedding Model

	```bash
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store \
	--embedding-model "sentence-transformers/all-mpnet-base-v2"
	```

	Available models:
	- `all-MiniLM-L6-v2` (default) - Fast, 384d, good quality
	- `all-mpnet-base-v2` - Better quality, 768d, slower
	- `multi-qa-mpnet-base-dot-v1` - Optimized for Q&A

	### Build and Upload to HF

	```bash
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store \
	--upload \
	--repo-id sniro23/VedaMD-Vector-Store
	```

	Note: Requires `HF_TOKEN` environment variable or `--hf-token` argument

	---

	## Adding Single Documents

	### Basic Usage

	Add a new guideline to existing vector store:

	```bash
	python scripts/add_document.py \
	--file ./new_guideline.pdf \
	--citation "SLCOG Hypertension Guidelines 2025" \
	--category "Obstetrics" \
	--vector-store-dir ./data/vector_store
	```

	Expected output:
	```
	============================================================
	📄 Adding document: new_guideline.pdf
	============================================================
	📄 Extracting with PyMuPDF: ./new_guideline.pdf
	✅ Extracted 12,456 characters from 8 pages
	🔑 File hash: a3f2c9d8e1b0...
	🔍 Checking for duplicates...
	✅ No duplicates found
	📝 Created 14 chunks
	🧮 Generating embeddings...
	📊 Adding to FAISS index...
	✅ Added 14 chunks to vector store
	📊 New total: 261 vectors

	============================================================
	💾 Saving updated vector store...
	============================================================
	📦 Backup created: data/vector_store/backups/20251023_150000
	✅ Saved FAISS index
	✅ Saved documents
	✅ Saved metadata
	✅ Updated config

	============================================================
	✅ DOCUMENT ADDED SUCCESSFULLY!
	============================================================
	📊 Summary:
	• Chunks added: 14
	• Total vectors: 261
	• Time taken: 8.43 seconds
	============================================================
	```

	### Add and Upload to HF

	```bash
	python scripts/add_document.py \
	--file ./new_guideline.pdf \
	--citation "WHO Guidelines 2025" \
	--vector-store-dir ./data/vector_store \
	--upload \
	--repo-id sniro23/VedaMD-Vector-Store
	```

	### Allow Duplicates

	By default, duplicate detection is enabled. To force add:

	```bash
	python scripts/add_document.py \
	--file ./updated_guideline.pdf \
	--vector-store-dir ./data/vector_store \
	--no-duplicate-check
	```

	---

	## Updating Existing Documents

	To update an existing guideline:

	1. Add new version (recommended):
	```bash
	python scripts/add_document.py \
	--file ./guidelines_v2.pdf \
	--citation "SLCOG Hypertension Guidelines 2025 v2" \
	--vector-store-dir ./data/vector_store
	```

	2. Rebuild from scratch (if major changes):
	```bash
	# Move old PDFs to archive
	mkdir -p Obs/archive
	mv Obs/old_guideline.pdf Obs/archive/

	# Add new version
	cp ~/Downloads/new_guideline.pdf Obs/

	# Rebuild
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store
	```

	---

	## Uploading to Hugging Face

	### Setup HF Token

	```bash
	# Option 1: Environment variable (recommended)
	export HF_TOKEN="hf_your_token_here"

	# Option 2: Pass as argument
	python scripts/build_vector_store.py --hf-token "hf_your_token_here" ...
	```

	### Initial Upload

	```bash
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store \
	--upload \
	--repo-id sniro23/VedaMD-Vector-Store
	```

	### Incremental Upload

	After adding a document:

	```bash
	python scripts/add_document.py \
	--file ./new.pdf \
	--vector-store-dir ./data/vector_store \
	--upload \
	--repo-id sniro23/VedaMD-Vector-Store
	```

	### What Gets Uploaded

	- ✅ `faiss_index.bin` - FAISS vector index
	- ✅ `documents.json` - Document chunks
	- ✅ `metadata.json` - Citations, sources, sections
	- ✅ `config.json` - Configuration settings
	- ✅ `build_log.json` - Build information

	---

	## Advanced Usage

	### Batch Processing Multiple Files

	```bash
	# Create a script to add multiple files
	for pdf in new_guidelines/*.pdf; do
	python scripts/add_document.py \
	--file "$pdf" \
	--citation "$(basename "$pdf" .pdf)" \
	--vector-store-dir ./data/vector_store
	done

	# Then upload once
	python scripts/add_document.py \
	--file dummy.pdf \
	--vector-store-dir ./data/vector_store \
	--upload \
	--repo-id sniro23/VedaMD-Vector-Store \
	--no-duplicate-check
	```

	### Inspecting Vector Store

	```bash
	# View config
	cat data/vector_store/config.json

	# View build log
	cat data/vector_store/build_log.json \| python -m json.tool

	# Count documents
	python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))"

	# List sources
	python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))"
	```

	### Backup Management

	Backups are created automatically in `data/vector_store/backups/`:

	```bash
	# List backups
	ls -lh data/vector_store/backups/

	# Restore from backup (if needed)
	cp data/vector_store/backups/20251023_150000/* data/vector_store/
	```

	### Quality Checks

	Check extraction quality for a specific PDF:

	```python
	from scripts.build_vector_store import PDFExtractor

	text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf")
	print(f"Extracted {len(text)} characters")
	print(f"Pages: {metadata['pages']}")
	print(f"Method: {metadata['method']}")
	print(f"\nFirst 500 chars:\n{text[:500]}")
	```

	---

	## Troubleshooting

	### Issue: "No PDF files found"

	Solution:
	```bash
	# Check directory exists
	ls -la ./Obs

	# Use absolute path
	python scripts/build_vector_store.py \
	--input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \
	--output-dir ./data/vector_store
	```

	### Issue: "Extracted text too short"

	Causes:
	- Scanned PDF (image-based)
	- Encrypted PDF
	- Corrupted PDF

	Solution:
	```bash
	# Check PDF manually
	open Obs/problematic.pdf

	# Try with OCR (requires tesseract)
	pip install pytesseract
	# Script will auto-fallback to OCR
	```

	### Issue: "Embedding dimension mismatch"

	Solution:
	```bash
	# Check existing config
	cat data/vector_store/config.json

	# Rebuild with same model
	python scripts/build_vector_store.py \
	--embedding-model "sentence-transformers/all-MiniLM-L6-v2" \
	--input-dir ./Obs \
	--output-dir ./data/vector_store
	```

	### Issue: "Upload failed"

	Solution:
	```bash
	# Check HF token
	echo $HF_TOKEN

	# Test token
	python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())"

	# Create repo first
	python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)"
	```

	### Issue: "Out of memory"

	Solution:
	```bash
	# Reduce batch size in script (edit build_vector_store.py)
	# Line ~338: change batch_size=32 to batch_size=8

	# Or process PDFs in smaller batches
	mkdir -p Obs/batch1 Obs/batch2
	# Move PDFs into batches
	python scripts/build_vector_store.py --input-dir Obs/batch1 ...
	python scripts/add_document.py --file Obs/batch2/*.pdf ...
	```

	### Issue: "Duplicate detected but I want to update"

	Solution:
	```bash
	# Option 1: Force add (creates duplicate)
	python scripts/add_document.py \
	--file ./updated.pdf \
	--no-duplicate-check \
	--vector-store-dir ./data/vector_store

	# Option 2: Rebuild from scratch
	python scripts/build_vector_store.py \
	--input-dir ./Obs \
	--output-dir ./data/vector_store
	```

	---

	## Best Practices

	### 1. Organize Your PDFs

	```
	Obs/
	├── obstetrics/
	│ ├── preeclampsia.pdf
	│ ├── hemorrhage.pdf
	│ └── ...
	├── cardiology/
	│ └── ...
	└── general/
	└── ...
	```

	### 2. Use Meaningful Citations

	```bash
	# Good
	--citation "SLCOG Preeclampsia Management Guidelines 2025"

	# Bad
	--citation "guideline.pdf"
	```

	### 3. Regular Backups

	```bash
	# Before major changes
	cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d)
	```

	### 4. Test Before Uploading

	```bash
	# Build locally first
	python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs

	# Test with RAG system
	# Then upload
	python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload
	```

	### 5. Version Control

	Add to `.gitignore`:
	```
	data/vector_store/
	test_vector_store/
	*.log
	backups/
	```

	Keep in Git:
	```
	scripts/
	Obs/
	requirements.txt
	```

	---

	## Integration with VedaMD

	### Using Your Vector Store

	After building, update your RAG system:

	```python
	# In enhanced_groq_medical_rag.py or wherever vector store is loaded

	# Option 1: Load from local directory
	vector_store = SimpleVectorStore("./data/vector_store")

	# Option 2: Load from HF Hub
	vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
	```

	### Automatic Reloading

	For production, reload vector store periodically:

	```python
	import schedule
	import time

	def reload_vector_store():
	global vector_store
	vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store")
	logger.info("✅ Vector store reloaded")

	# Reload every 6 hours
	schedule.every(6).hours.do(reload_vector_store)

	while True:
	schedule.run_pending()
	time.sleep(60)
	```

	---

	## Next Steps

	1. Build your initial vector store:
	```bash
	python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store
	```

	2. Upload to HF:
	```bash
	python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store
	```

	3. Test with RAG system:
	```bash
	python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))"
	```

	4. Add new documents as they arrive:
	```bash
	python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload
	```

	---

	Questions or Issues?

	Check the logs:
	- `vector_store_build.log` - Build process
	- `add_document.log` - Document additions

	Or review the scripts:
	- [scripts/build_vector_store.py](scripts/build_vector_store.py)
	- [scripts/add_document.py](scripts/add_document.py)

	---

	Last Updated: October 23, 2025
	Version: 1.0.0