A newer version of the Streamlit SDK is available: 1.58.0
metadata
title: Pathology RAG System
emoji: π¬
colorFrom: blue
colorTo: indigo
sdk: streamlit
app_file: app.py
pinned: false
Visual Architecture Diagram - Pathology Report Knowledge Extraction
System Architecture - Complete Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PATHOLOGY REPORT PROCESSING SYSTEM β
β RAG + Spark NLP + Vector Database β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. DATA INGESTION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Pathology Reports (PDF)
β
ββββ Scanned PDFs βββββββββΊ OCR (Tesseract/EasyOCR)
β β
ββββ Digital PDFs βββββββββΊ PyMuPDF Text Extraction
β
βΌ
π Raw Text Files
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β 2. SPARK NLP PROCESSING β β
ββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
β Spark NLP Medical Pipeline β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stage 1: Document Assembly β β
β β β’ DocumentAssembler β β
β β β’ SentenceDetector β β
β β β’ Tokenizer β β
β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stage 2: Entity Recognition (NER) β β
β β β’ Medical NER (BioBERT/ClinicalBERT) β β
β β β’ Extract: PROBLEM, TREATMENT, TEST, β β
β β ANATOMY, LAB_VALUE β β
β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stage 3: Assertion & Relations β β
β β β’ AssertionDL (present/absent/possible) β β
β β β’ RelationExtraction (entity links) β β
β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
βΌ
π Structured Clinical Data
{
"entities": [...],
"relations": [...],
"assertions": [...],
"metadata": {...}
}
β
β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β 3. CHUNKING & ENRICHMENT β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββ΄ββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ
β Section-Based β β Semantic-Based β
β Chunking β β Chunking β
β β β β
β β’ Clinical β β β’ 512-1024 β
β History β β tokens β
β β’ Findings β β β’ 128 overlap β
β β’ Diagnosis β β β’ Entity-aware β
β β’ Treatment β β β
ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ
β β
ββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
π¦ Enriched Chunks with Metadata
{
"chunk_id": "...",
"text": "...",
"entities": [...],
"section": "...",
"report_date": "...",
"report_type": "..."
}
β
β
βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β 4. EMBEDDING GENERATION β
βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββ΄ββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββ ββββββββββββββββββββββ
β Dense Embeddings β β Sparse Embeddings β
β β β β
β β’ BioBERT β β β’ BM25 β
β β’ ClinicalBERT β β β’ TF-IDF β
β β’ PubMedBERT β β β’ Keyword Index β
β β’ SapBERT β β β
β β β β
β 768-dim vectors β β Sparse vectors β
ββββββββββ¬ββββββββββββ ββββββββββ¬ββββββββββββ
β β
βββββββββββββ¬βββββββββββββ
β
βΌ
π’ Hybrid Embeddings
β
β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ
β 5. VECTOR DATABASE STORAGE β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β ChromaDB β β FAISS β β Pinecone β
β β β β β β
β β’ Dev/Test β β β’ Production β β β’ Cloud β
β β’ Easy setup β β β’ Fast β β β’ Managed β
β β’ Metadata β β β’ Scalable β β β’ Enterprise β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
ββββββββββββββββββΌβββββββββββββββββ
β
βΌ
πΎ Indexed Knowledge Base
β’ Embeddings: 384-768 dims
β’ Metadata: entities, dates, types
β’ Relations: entity graphs
β
β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ
β 6. QUERY & RETRIEVAL (RAG) β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ
β
π€ User Query: "What are ER+ breast cancer markers?"
β
βΌ
βββββββββββββββββββββββββββββββββ
β Query Processing β
β β’ Entity extraction β
β β’ Query expansion β
β β’ Generate embeddings β
βββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Hybrid Retrieval β
β β
β βββββββββββββββββββββββ β
β β Dense Search β β
β β (Semantic) ββββββΌβββΊ Top 20 chunks
β βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ β
β β Sparse Search β β
β β (BM25/Keywords) ββββββΌβββΊ Top 20 chunks
β βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ β
β β Entity Filter β β
β β (Medical entities) ββββββΌβββΊ Filtered
β βββββββββββββββββββββββ β
βββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Reranking β
β β’ Cross-encoder scoring β
β β’ Medical relevance β
β β’ Temporal filtering β
βββββββββββββ¬ββββββββββββββββββββ
β
βΌ
π Top 5-10 Relevant Chunks
β
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
β 7. GENERATION (LLM) β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Prompt Construction β
β β
β System: "You are a medical β
β expert assistant..." β
β β
β Context: [Retrieved chunks] β
β β
β Query: [User question] β
β β
β Instructions: "Answer with β
β citations..." β
βββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β LLM (Claude/GPT-4/Med-PaLM) β
β β
β β’ Medical reasoning β
β β’ Citation generation β
β β’ Accuracy validation β
βββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Post-processing β
β β’ Format citations β
β β’ Fact checking β
β β’ Safety validation β
βββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
π¬ Final Response
β
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
β 8. USER INTERFACE β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββ βββββββββββ ββββββββββββ
β CLI β β Web UI β β REST API β
β β β β β β
β Python β βStreamlitβ β FastAPI β
β Script β β Gradio β β β
ββββββββββ βββββββββββ ββββββββββββ
β β β
βββββββββββββββΌββββββββββββββ
β
βΌ
π User Gets Answer
with Citations & Sources
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SUPPORTING COMPONENTS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ
β Monitoring & β β Knowledge Graph β β Caching Layer β
β Logging β β (Optional) β β β
β β β β β β’ Query cache β
β β’ MLflow β β β’ Neo4j β β β’ Embedding cache β
β β’ W&B β β β’ NetworkX β β β’ LLM response cache β
β β’ Prometheus β β β’ Entity graphs β β β’ Redis/Memcached β
βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ
βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ
β Security & β β Evaluation β β Data Pipeline β
β Compliance β β Metrics β β β
β β β β β β’ Apache Airflow β
β β’ De-identificationβ β β’ Precision@k β β β’ Spark jobs β
β β’ HIPAA compliance β β β’ Medical accuracy β β β’ Batch processing β
β β’ Access control β β β’ Latency β β β’ ETL workflows β
βββββββββββββββββββββββ βββββββββββββββββββββββ ββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
END-TO-END FLOW
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
