Spaces:
Sleeping
Sleeping
| # VedaMD Document Pipeline Guide | |
| **Complete guide for adding and managing medical documents in VedaMD** | |
| --- | |
| ## Table of Contents | |
| 1. [Overview](#overview) | |
| 2. [Quick Start](#quick-start) | |
| 3. [Building Vector Store from Scratch](#building-vector-store-from-scratch) | |
| 4. [Adding Single Documents](#adding-single-documents) | |
| 5. [Updating Existing Documents](#updating-existing-documents) | |
| 6. [Uploading to Hugging Face](#uploading-to-hugging-face) | |
| 7. [Advanced Usage](#advanced-usage) | |
| 8. [Troubleshooting](#troubleshooting) | |
| --- | |
| ## Overview | |
| ### What is the Pipeline? | |
| The VedaMD pipeline automates the process of converting medical PDF documents into a searchable vector store that powers the RAG system. | |
| **Before Pipeline** (Manual Process): | |
| ``` | |
| PDF → Extract Text → Chunk → Embed → Build FAISS → Upload to HF | |
| ↓ ↓ ↓ ↓ ↓ ↓ | |
| Hours Manual Script Script External Manual | |
| Work Needed Needed Tool Upload | |
| ``` | |
| **With Pipeline** (Automated): | |
| ``` | |
| PDF → python add_document.py file.pdf → Done ✅ | |
| ↓ | |
| Minutes | |
| ``` | |
| ### Pipeline Components | |
| 1. **build_vector_store.py** - Build complete vector store from directory of PDFs | |
| 2. **add_document.py** - Add single documents to existing vector store | |
| 3. **Automatic Features**: | |
| - PDF text extraction (PyMuPDF, pdfplumber, OCR fallback) | |
| - Smart medical chunking | |
| - Duplicate detection | |
| - Quality validation | |
| - HF Hub integration | |
| - Automatic backups | |
| --- | |
| ## Quick Start | |
| ### Prerequisites | |
| All required packages are already installed in your `.venv`: | |
| - ✅ PyMuPDF (PDF extraction) | |
| - ✅ pdfplumber (backup PDF extraction) | |
| - ✅ sentence-transformers (embeddings) | |
| - ✅ faiss-cpu (vector indexing) | |
| - ✅ huggingface-hub (uploading) | |
| ### 30-Second Test | |
| ```bash | |
| # Activate environment | |
| cd "/Users/niro/Documents/SL Clinical Assistant" | |
| source .venv/bin/activate | |
| # Build vector store from your existing PDFs | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store | |
| # That's it! ✅ | |
| ``` | |
| --- | |
| ## Building Vector Store from Scratch | |
| ### Basic Usage | |
| Build a vector store from all PDFs in a directory: | |
| ```bash | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store | |
| ``` | |
| **Expected output:** | |
| ``` | |
| 🚀 STARTING VECTOR STORE BUILD | |
| ============================================================ | |
| 🔍 Scanning for PDFs in Obs | |
| ✅ Found 15 PDF files | |
| 📄 Breech.pdf | |
| 📄 RhESUS.pdf | |
| ... (13 more) | |
| ============================================================ | |
| 📄 Processing: Breech.pdf | |
| ============================================================ | |
| 📄 Extracting with PyMuPDF: Obs/Breech.pdf | |
| ✅ Extracted 1988 characters from 1 pages | |
| 📝 Chunking text from Breech.pdf | |
| ✅ Created 2 chunks from Breech.pdf | |
| 🧮 Generating embeddings for 2 chunks... | |
| ✅ Processed Breech.pdf: 2 chunks added | |
| ... (processes all PDFs) | |
| ============================================================ | |
| ✅ BUILD COMPLETE! | |
| ============================================================ | |
| 📊 Summary: | |
| • PDFs processed: 15 | |
| • Total chunks: 247 | |
| • Embedding dimension: 384 | |
| • Output directory: ./data/vector_store | |
| • Build time: 45.23 seconds | |
| ============================================================ | |
| ``` | |
| ### Customizing Chunk Size | |
| For longer/shorter chunks: | |
| ```bash | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store \ | |
| --chunk-size 1500 \ | |
| --chunk-overlap 150 | |
| ``` | |
| **Recommendations:** | |
| - **chunk-size**: 800-1200 (default: 1000) | |
| - **chunk-overlap**: 50-200 (default: 100) | |
| - Smaller chunks = more precise retrieval | |
| - Larger chunks = better context | |
| ### Using Different Embedding Model | |
| ```bash | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store \ | |
| --embedding-model "sentence-transformers/all-mpnet-base-v2" | |
| ``` | |
| **Available models:** | |
| - `all-MiniLM-L6-v2` (default) - Fast, 384d, good quality | |
| - `all-mpnet-base-v2` - Better quality, 768d, slower | |
| - `multi-qa-mpnet-base-dot-v1` - Optimized for Q&A | |
| ### Build and Upload to HF | |
| ```bash | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store \ | |
| --upload \ | |
| --repo-id sniro23/VedaMD-Vector-Store | |
| ``` | |
| **Note**: Requires `HF_TOKEN` environment variable or `--hf-token` argument | |
| --- | |
| ## Adding Single Documents | |
| ### Basic Usage | |
| Add a new guideline to existing vector store: | |
| ```bash | |
| python scripts/add_document.py \ | |
| --file ./new_guideline.pdf \ | |
| --citation "SLCOG Hypertension Guidelines 2025" \ | |
| --category "Obstetrics" \ | |
| --vector-store-dir ./data/vector_store | |
| ``` | |
| **Expected output:** | |
| ``` | |
| ============================================================ | |
| 📄 Adding document: new_guideline.pdf | |
| ============================================================ | |
| 📄 Extracting with PyMuPDF: ./new_guideline.pdf | |
| ✅ Extracted 12,456 characters from 8 pages | |
| 🔑 File hash: a3f2c9d8e1b0... | |
| 🔍 Checking for duplicates... | |
| ✅ No duplicates found | |
| 📝 Created 14 chunks | |
| 🧮 Generating embeddings... | |
| 📊 Adding to FAISS index... | |
| ✅ Added 14 chunks to vector store | |
| 📊 New total: 261 vectors | |
| ============================================================ | |
| 💾 Saving updated vector store... | |
| ============================================================ | |
| 📦 Backup created: data/vector_store/backups/20251023_150000 | |
| ✅ Saved FAISS index | |
| ✅ Saved documents | |
| ✅ Saved metadata | |
| ✅ Updated config | |
| ============================================================ | |
| ✅ DOCUMENT ADDED SUCCESSFULLY! | |
| ============================================================ | |
| 📊 Summary: | |
| • Chunks added: 14 | |
| • Total vectors: 261 | |
| • Time taken: 8.43 seconds | |
| ============================================================ | |
| ``` | |
| ### Add and Upload to HF | |
| ```bash | |
| python scripts/add_document.py \ | |
| --file ./new_guideline.pdf \ | |
| --citation "WHO Guidelines 2025" \ | |
| --vector-store-dir ./data/vector_store \ | |
| --upload \ | |
| --repo-id sniro23/VedaMD-Vector-Store | |
| ``` | |
| ### Allow Duplicates | |
| By default, duplicate detection is enabled. To force add: | |
| ```bash | |
| python scripts/add_document.py \ | |
| --file ./updated_guideline.pdf \ | |
| --vector-store-dir ./data/vector_store \ | |
| --no-duplicate-check | |
| ``` | |
| --- | |
| ## Updating Existing Documents | |
| To update an existing guideline: | |
| 1. **Add new version** (recommended): | |
| ```bash | |
| python scripts/add_document.py \ | |
| --file ./guidelines_v2.pdf \ | |
| --citation "SLCOG Hypertension Guidelines 2025 v2" \ | |
| --vector-store-dir ./data/vector_store | |
| ``` | |
| 2. **Rebuild from scratch** (if major changes): | |
| ```bash | |
| # Move old PDFs to archive | |
| mkdir -p Obs/archive | |
| mv Obs/old_guideline.pdf Obs/archive/ | |
| # Add new version | |
| cp ~/Downloads/new_guideline.pdf Obs/ | |
| # Rebuild | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store | |
| ``` | |
| --- | |
| ## Uploading to Hugging Face | |
| ### Setup HF Token | |
| ```bash | |
| # Option 1: Environment variable (recommended) | |
| export HF_TOKEN="hf_your_token_here" | |
| # Option 2: Pass as argument | |
| python scripts/build_vector_store.py --hf-token "hf_your_token_here" ... | |
| ``` | |
| ### Initial Upload | |
| ```bash | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store \ | |
| --upload \ | |
| --repo-id sniro23/VedaMD-Vector-Store | |
| ``` | |
| ### Incremental Upload | |
| After adding a document: | |
| ```bash | |
| python scripts/add_document.py \ | |
| --file ./new.pdf \ | |
| --vector-store-dir ./data/vector_store \ | |
| --upload \ | |
| --repo-id sniro23/VedaMD-Vector-Store | |
| ``` | |
| ### What Gets Uploaded | |
| - ✅ `faiss_index.bin` - FAISS vector index | |
| - ✅ `documents.json` - Document chunks | |
| - ✅ `metadata.json` - Citations, sources, sections | |
| - ✅ `config.json` - Configuration settings | |
| - ✅ `build_log.json` - Build information | |
| --- | |
| ## Advanced Usage | |
| ### Batch Processing Multiple Files | |
| ```bash | |
| # Create a script to add multiple files | |
| for pdf in new_guidelines/*.pdf; do | |
| python scripts/add_document.py \ | |
| --file "$pdf" \ | |
| --citation "$(basename "$pdf" .pdf)" \ | |
| --vector-store-dir ./data/vector_store | |
| done | |
| # Then upload once | |
| python scripts/add_document.py \ | |
| --file dummy.pdf \ | |
| --vector-store-dir ./data/vector_store \ | |
| --upload \ | |
| --repo-id sniro23/VedaMD-Vector-Store \ | |
| --no-duplicate-check | |
| ``` | |
| ### Inspecting Vector Store | |
| ```bash | |
| # View config | |
| cat data/vector_store/config.json | |
| # View build log | |
| cat data/vector_store/build_log.json | python -m json.tool | |
| # Count documents | |
| python -c "import json; print(len(json.load(open('data/vector_store/documents.json'))))" | |
| # List sources | |
| python -c "import json; meta=json.load(open('data/vector_store/metadata.json')); print(set(m['source'] for m in meta))" | |
| ``` | |
| ### Backup Management | |
| Backups are created automatically in `data/vector_store/backups/`: | |
| ```bash | |
| # List backups | |
| ls -lh data/vector_store/backups/ | |
| # Restore from backup (if needed) | |
| cp data/vector_store/backups/20251023_150000/* data/vector_store/ | |
| ``` | |
| ### Quality Checks | |
| Check extraction quality for a specific PDF: | |
| ```python | |
| from scripts.build_vector_store import PDFExtractor | |
| text, metadata = PDFExtractor.extract_text("Obs/Breech.pdf") | |
| print(f"Extracted {len(text)} characters") | |
| print(f"Pages: {metadata['pages']}") | |
| print(f"Method: {metadata['method']}") | |
| print(f"\nFirst 500 chars:\n{text[:500]}") | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### Issue: "No PDF files found" | |
| **Solution:** | |
| ```bash | |
| # Check directory exists | |
| ls -la ./Obs | |
| # Use absolute path | |
| python scripts/build_vector_store.py \ | |
| --input-dir "/Users/niro/Documents/SL Clinical Assistant/Obs" \ | |
| --output-dir ./data/vector_store | |
| ``` | |
| ### Issue: "Extracted text too short" | |
| **Causes:** | |
| - Scanned PDF (image-based) | |
| - Encrypted PDF | |
| - Corrupted PDF | |
| **Solution:** | |
| ```bash | |
| # Check PDF manually | |
| open Obs/problematic.pdf | |
| # Try with OCR (requires tesseract) | |
| pip install pytesseract | |
| # Script will auto-fallback to OCR | |
| ``` | |
| ### Issue: "Embedding dimension mismatch" | |
| **Solution:** | |
| ```bash | |
| # Check existing config | |
| cat data/vector_store/config.json | |
| # Rebuild with same model | |
| python scripts/build_vector_store.py \ | |
| --embedding-model "sentence-transformers/all-MiniLM-L6-v2" \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store | |
| ``` | |
| ### Issue: "Upload failed" | |
| **Solution:** | |
| ```bash | |
| # Check HF token | |
| echo $HF_TOKEN | |
| # Test token | |
| python -c "from huggingface_hub import HfApi; print(HfApi(token='$HF_TOKEN').whoami())" | |
| # Create repo first | |
| python -c "from huggingface_hub import create_repo; create_repo('sniro23/VedaMD-Vector-Store', repo_type='dataset', exist_ok=True)" | |
| ``` | |
| ### Issue: "Out of memory" | |
| **Solution:** | |
| ```bash | |
| # Reduce batch size in script (edit build_vector_store.py) | |
| # Line ~338: change batch_size=32 to batch_size=8 | |
| # Or process PDFs in smaller batches | |
| mkdir -p Obs/batch1 Obs/batch2 | |
| # Move PDFs into batches | |
| python scripts/build_vector_store.py --input-dir Obs/batch1 ... | |
| python scripts/add_document.py --file Obs/batch2/*.pdf ... | |
| ``` | |
| ### Issue: "Duplicate detected but I want to update" | |
| **Solution:** | |
| ```bash | |
| # Option 1: Force add (creates duplicate) | |
| python scripts/add_document.py \ | |
| --file ./updated.pdf \ | |
| --no-duplicate-check \ | |
| --vector-store-dir ./data/vector_store | |
| # Option 2: Rebuild from scratch | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./Obs \ | |
| --output-dir ./data/vector_store | |
| ``` | |
| --- | |
| ## Best Practices | |
| ### 1. Organize Your PDFs | |
| ``` | |
| Obs/ | |
| ├── obstetrics/ | |
| │ ├── preeclampsia.pdf | |
| │ ├── hemorrhage.pdf | |
| │ └── ... | |
| ├── cardiology/ | |
| │ └── ... | |
| └── general/ | |
| └── ... | |
| ``` | |
| ### 2. Use Meaningful Citations | |
| ```bash | |
| # Good | |
| --citation "SLCOG Preeclampsia Management Guidelines 2025" | |
| # Bad | |
| --citation "guideline.pdf" | |
| ``` | |
| ### 3. Regular Backups | |
| ```bash | |
| # Before major changes | |
| cp -r data/vector_store data/vector_store_backup_$(date +%Y%m%d) | |
| ``` | |
| ### 4. Test Before Uploading | |
| ```bash | |
| # Build locally first | |
| python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./test_vs | |
| # Test with RAG system | |
| # Then upload | |
| python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload | |
| ``` | |
| ### 5. Version Control | |
| Add to `.gitignore`: | |
| ``` | |
| data/vector_store/ | |
| test_vector_store/ | |
| *.log | |
| backups/ | |
| ``` | |
| Keep in Git: | |
| ``` | |
| scripts/ | |
| Obs/ | |
| requirements.txt | |
| ``` | |
| --- | |
| ## Integration with VedaMD | |
| ### Using Your Vector Store | |
| After building, update your RAG system: | |
| ```python | |
| # In enhanced_groq_medical_rag.py or wherever vector store is loaded | |
| # Option 1: Load from local directory | |
| vector_store = SimpleVectorStore("./data/vector_store") | |
| # Option 2: Load from HF Hub | |
| vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store") | |
| ``` | |
| ### Automatic Reloading | |
| For production, reload vector store periodically: | |
| ```python | |
| import schedule | |
| import time | |
| def reload_vector_store(): | |
| global vector_store | |
| vector_store = SimpleVectorStore.from_pretrained("sniro23/VedaMD-Vector-Store") | |
| logger.info("✅ Vector store reloaded") | |
| # Reload every 6 hours | |
| schedule.every(6).hours.do(reload_vector_store) | |
| while True: | |
| schedule.run_pending() | |
| time.sleep(60) | |
| ``` | |
| --- | |
| ## Next Steps | |
| 1. **Build your initial vector store:** | |
| ```bash | |
| python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store | |
| ``` | |
| 2. **Upload to HF:** | |
| ```bash | |
| python scripts/build_vector_store.py --input-dir ./Obs --output-dir ./data/vector_store --upload --repo-id sniro23/VedaMD-Vector-Store | |
| ``` | |
| 3. **Test with RAG system:** | |
| ```bash | |
| python -c "from src.enhanced_groq_medical_rag import EnhancedGroqMedicalRAG; rag = EnhancedGroqMedicalRAG(); print(rag.query('What is preeclampsia?'))" | |
| ``` | |
| 4. **Add new documents as they arrive:** | |
| ```bash | |
| python scripts/add_document.py --file ./new.pdf --vector-store-dir ./data/vector_store --upload | |
| ``` | |
| --- | |
| **Questions or Issues?** | |
| Check the logs: | |
| - `vector_store_build.log` - Build process | |
| - `add_document.log` - Document additions | |
| Or review the scripts: | |
| - [scripts/build_vector_store.py](scripts/build_vector_store.py) | |
| - [scripts/add_document.py](scripts/add_document.py) | |
| --- | |
| **Last Updated**: October 23, 2025 | |
| **Version**: 1.0.0 | |