| --- |
| title: Pathology RAG System |
| emoji: π¬ |
| colorFrom: blue |
| colorTo: indigo |
| sdk: streamlit |
| app_file: app.py |
| pinned: false |
| --- |
| |
|  |
|
|
| # Visual Architecture Diagram - Pathology Report Knowledge Extraction |
|
|
| ## System Architecture - Complete Flow |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β PATHOLOGY REPORT PROCESSING SYSTEM β |
| β RAG + Spark NLP + Vector Database β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 1. DATA INGESTION β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| π Pathology Reports (PDF) |
| β |
| ββββ Scanned PDFs βββββββββΊ OCR (Tesseract/EasyOCR) |
| β β |
| ββββ Digital PDFs βββββββββΊ PyMuPDF Text Extraction |
| β |
| βΌ |
| π Raw Text Files |
| β |
| β |
| ββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ |
| β 2. SPARK NLP PROCESSING β β |
| ββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ |
| β |
| ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ |
| β Spark NLP Medical Pipeline β |
| β β |
| β ββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Stage 1: Document Assembly β β |
| β β β’ DocumentAssembler β β |
| β β β’ SentenceDetector β β |
| β β β’ Tokenizer β β |
| β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β |
| β βΌ β |
| β ββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Stage 2: Entity Recognition (NER) β β |
| β β β’ Medical NER (BioBERT/ClinicalBERT) β β |
| β β β’ Extract: PROBLEM, TREATMENT, TEST, β β |
| β β ANATOMY, LAB_VALUE β β |
| β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β |
| β βΌ β |
| β ββββββββββββββββββββββββββββββββββββββββββββββ β |
| β β Stage 3: Assertion & Relations β β |
| β β β’ AssertionDL (present/absent/possible) β β |
| β β β’ RelationExtraction (entity links) β β |
| β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β |
| β β |
| βββββββββββββββββββΌββββββββββββββββββββββββββββββββββ |
| βΌ |
| π Structured Clinical Data |
| { |
| "entities": [...], |
| "relations": [...], |
| "assertions": [...], |
| "metadata": {...} |
| } |
| β |
| β |
| βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ |
| β 3. CHUNKING & ENRICHMENT β |
| βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββββ΄ββββββββββββββββββ |
| β β |
| βΌ βΌ |
| ββββββββββββββββββββ ββββββββββββββββββββ |
| β Section-Based β β Semantic-Based β |
| β Chunking β β Chunking β |
| β β β β |
| β β’ Clinical β β β’ 512-1024 β |
| β History β β tokens β |
| β β’ Findings β β β’ 128 overlap β |
| β β’ Diagnosis β β β’ Entity-aware β |
| β β’ Treatment β β β |
| ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ |
| β β |
| ββββββββββββββ¬βββββββββββββββββββββ |
| β |
| βΌ |
| π¦ Enriched Chunks with Metadata |
| { |
| "chunk_id": "...", |
| "text": "...", |
| "entities": [...], |
| "section": "...", |
| "report_date": "...", |
| "report_type": "..." |
| } |
| β |
| β |
| βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ |
| β 4. EMBEDDING GENERATION β |
| βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββ΄ββββββββββββ |
| β β |
| βΌ βΌ |
| ββββββββββββββββββββββ ββββββββββββββββββββββ |
| β Dense Embeddings β β Sparse Embeddings β |
| β β β β |
| β β’ BioBERT β β β’ BM25 β |
| β β’ ClinicalBERT β β β’ TF-IDF β |
| β β’ PubMedBERT β β β’ Keyword Index β |
| β β’ SapBERT β β β |
| β β β β |
| β 768-dim vectors β β Sparse vectors β |
| ββββββββββ¬ββββββββββββ ββββββββββ¬ββββββββββββ |
| β β |
| βββββββββββββ¬βββββββββββββ |
| β |
| βΌ |
| π’ Hybrid Embeddings |
| β |
| β |
| βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ |
| β 5. VECTOR DATABASE STORAGE β |
| βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββββΌββββββββββββββββ |
| β β β |
| βΌ βΌ βΌ |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ |
| β ChromaDB β β FAISS β β Pinecone β |
| β β β β β β |
| β β’ Dev/Test β β β’ Production β β β’ Cloud β |
| β β’ Easy setup β β β’ Fast β β β’ Managed β |
| β β’ Metadata β β β’ Scalable β β β’ Enterprise β |
| ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ |
| β β β |
| ββββββββββββββββββΌβββββββββββββββββ |
| β |
| βΌ |
| πΎ Indexed Knowledge Base |
| β’ Embeddings: 384-768 dims |
| β’ Metadata: entities, dates, types |
| β’ Relations: entity graphs |
| β |
| β |
| βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ |
| β 6. QUERY & RETRIEVAL (RAG) β |
| βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| π€ User Query: "What are ER+ breast cancer markers?" |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββ |
| β Query Processing β |
| β β’ Entity extraction β |
| β β’ Query expansion β |
| β β’ Generate embeddings β |
| βββββββββββββ¬ββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββ |
| β Hybrid Retrieval β |
| β β |
| β βββββββββββββββββββββββ β |
| β β Dense Search β β |
| β β (Semantic) ββββββΌβββΊ Top 20 chunks |
| β βββββββββββββββββββββββ β |
| β β |
| β βββββββββββββββββββββββ β |
| β β Sparse Search β β |
| β β (BM25/Keywords) ββββββΌβββΊ Top 20 chunks |
| β βββββββββββββββββββββββ β |
| β β |
| β βββββββββββββββββββββββ β |
| β β Entity Filter β β |
| β β (Medical entities) ββββββΌβββΊ Filtered |
| β βββββββββββββββββββββββ β |
| βββββββββββββ¬ββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββ |
| β Reranking β |
| β β’ Cross-encoder scoring β |
| β β’ Medical relevance β |
| β β’ Temporal filtering β |
| βββββββββββββ¬ββββββββββββββββββββ |
| β |
| βΌ |
| π Top 5-10 Relevant Chunks |
| β |
| β |
| βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 7. GENERATION (LLM) β |
| βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββ |
| β Prompt Construction β |
| β β |
| β System: "You are a medical β |
| β expert assistant..." β |
| β β |
| β Context: [Retrieved chunks] β |
| β β |
| β Query: [User question] β |
| β β |
| β Instructions: "Answer with β |
| β citations..." β |
| βββββββββββββ¬ββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββ |
| β LLM (Claude/GPT-4/Med-PaLM) β |
| β β |
| β β’ Medical reasoning β |
| β β’ Citation generation β |
| β β’ Accuracy validation β |
| βββββββββββββ¬ββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββ |
| β Post-processing β |
| β β’ Format citations β |
| β β’ Fact checking β |
| β β’ Safety validation β |
| βββββββββββββ¬ββββββββββββββββββββββββ |
| β |
| βΌ |
| π¬ Final Response |
| β |
| β |
| βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β 8. USER INTERFACE β |
| βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββΌββββββββββββββ |
| β β β |
| βΌ βΌ βΌ |
| ββββββββββ βββββββββββ ββββββββββββ |
| β CLI β β Web UI β β REST API β |
| β β β β β β |
| β Python β βStreamlitβ β FastAPI β |
| β Script β β Gradio β β β |
| ββββββββββ βββββββββββ ββββββββββββ |
| β β β |
| βββββββββββββββΌββββββββββββββ |
| β |
| βΌ |
| π User Gets Answer |
| with Citations & Sources |
| |
| |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β SUPPORTING COMPONENTS β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ |
| β Monitoring & β β Knowledge Graph β β Caching Layer β |
| β Logging β β (Optional) β β β |
| β β β β β β’ Query cache β |
| β β’ MLflow β β β’ Neo4j β β β’ Embedding cache β |
| β β’ W&B β β β’ NetworkX β β β’ LLM response cache β |
| β β’ Prometheus β β β’ Entity graphs β β β’ Redis/Memcached β |
| βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ |
| |
| βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ |
| β Security & β β Evaluation β β Data Pipeline β |
| β Compliance β β Metrics β β β |
| β β β β β β’ Apache Airflow β |
| β β’ De-identificationβ β β’ Precision@k β β β’ Spark jobs β |
| β β’ HIPAA compliance β β β’ Medical accuracy β β β’ Batch processing β |
| β β’ Access control β β β’ Latency β β β’ ETL workflows β |
| βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ |
| |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| END-TO-END FLOW |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| |
| |
| |
| |