CXR-VLM — Pipeline Overview

Vision-Language Model for Chest X-ray Interpretation  |  Based on RaDialog

1. Data Pipeline
flowchart LR subgraph SRC["📦 Nguồn dữ liệu (PhysioNet)"] direction TB D1["MIMIC-CXR\n(reports/*.txt)"] D2["MIMIC-CXR-JPG\n(files/**/*.jpg)"] D3["MIMIC-Ext-CXR-QBA\n(*.json)"] end subgraph BUILD["🔧 build_instruct_json()"] direction TB P1["Parse findings\ntừ report .txt"] P2["Parse impression\ntừ report .txt"] P3["Parse Q&A pairs\ntừ JSON"] end subgraph JSON["📄 Unified Instruction JSON"] direction TB J1["{ task: findings,\n image_path, target }"] J2["{ task: impression,\n image_path, target }"] J3["{ task: vqa,\n image_path, question, target }"] end D1 --> P1 & P2 D3 --> P3 D2 -.->|"image_path ref"| J1 & J2 & J3 P1 --> J1 P2 --> J2 P3 --> J3 style SRC fill:#1e3a5f,stroke:#38bdf8,color:#e2e8f0 style BUILD fill:#1e3a2a,stroke:#4ade80,color:#e2e8f0 style JSON fill:#3b2a1e,stroke:#f97316,color:#e2e8f0
2. Dataset & Prompt Construction
flowchart TD JSON2["📄 Instruction JSON\n(image_path, task, target, question?)"] subgraph DS["CXRInstructDataset.__getitem__()"] direction LR IMG["_load_image()\n→ tensor C×H×W"] TMPL["build_training_sample(task)\nprompt_templates.py"] TOK["_tokenize(prompt, target)\nlabels: -100 trên prompt tokens"] end subgraph PROMPTS["10 biến thể prompt mỗi task (random.choice)"] direction TB F["Task: findings\n→ random.choice(FINDINGS_PROMPTS)\ne.g. 'Generate the findings section...'"] I["Task: impression\n→ random.choice(IMPRESSION_PROMPTS)\ne.g. 'Provide a concise clinical impression...'"] V["Task: vqa\n→ câu hỏi trực tiếp từ dataset\ne.g. 'Is there pleural effusion?'"] end subgraph FMT["Vicuna v1.1 Format"] direction TB FMT1["SYSTEM: You are a radiologist...\nUSER: <image>\n[Predicted Findings: ...]\n{instruction}\nASSISTANT:"] end OUT["Batch output:\n{ image, input_ids, attention_mask, labels, task }"] JSON2 --> DS DS --> TMPL TMPL --> PROMPTS PROMPTS --> FMT FMT --> TOK IMG --> OUT TOK --> OUT style DS fill:#1e3a5f,stroke:#38bdf8,color:#e2e8f0 style PROMPTS fill:#2a1e3b,stroke:#a78bfa,color:#e2e8f0 style FMT fill:#3b2a1e,stroke:#f97316,color:#e2e8f0
3. Model Architecture & Forward Pass
flowchart TD subgraph INPUT["Input"] IMG3["Ảnh X-quang\n(B, C, 448, 448)"] TEXT["Tokenized Prompt\n(B, seq_len)"] end subgraph ENC["🔵 BioViL-T Encoder — FROZEN\nimage_encoder.py"] E1["Pretrained trên CXR + radiology reports\n(Microsoft hi-ml-multimodal)"] E2["patch_features\n(B, num_patches, 768)"] E1 --> E2 end subgraph PROJ["🟢 MLP Projection — TRAINED\nprojection.py"] direction TB PR1["Learnable query tokens\n(1, 32, 768)"] PR2["Cross-Attention\nquery ← patches"] PR3["MLP: 768 → 1024 → 4096"] PR4["image_tokens\n(B, 32, 4096)"] PR1 --> PR2 --> PR3 --> PR4 end subgraph CHEX["🟡 CheXpert Classifier — FROZEN (optional)\nchexpert_classifier.py"] C1["Structured labels\ne.g. 'Pleural Effusion: Positive'"] end subgraph LLM["🔴 Vicuna-7B + LoRA — TRAINED (LoRA only)\ncxr_vlm.py"] direction TB L1["embed_tokens(input_ids)\n→ text_embeds (B, seq_len, 4096)"] L2["_inject_image_tokens()\nThay <image> token → 32 visual tokens"] L3["LlamaForCausalLM.forward()\ninputs_embeds + attention_mask + labels"] L4["Cross-Entropy Loss\n(chỉ tính trên target tokens, prompt = -100)"] L1 --> L2 --> L3 --> L4 end IMG3 --> ENC --> PROJ TEXT --> LLM PROJ --> L2 CHEX -.->|prepend vào prompt text| TEXT style ENC fill:#1e2e4a,stroke:#60a5fa,color:#e2e8f0 style PROJ fill:#1e3a2a,stroke:#4ade80,color:#e2e8f0 style CHEX fill:#3b3a1e,stroke:#facc15,color:#e2e8f0 style LLM fill:#3a1e1e,stroke:#f87171,color:#e2e8f0 style INPUT fill:#1e2433,stroke:#64748b,color:#e2e8f0
4. Training Stages
flowchart LR subgraph S1["Stage 1 — Alignment\nset_stage1_mode()"] direction TB S1A["🔵 BioViL-T → FROZEN"] S1B["🟢 MLP Projection → TRAIN ✓"] S1C["🔴 Vicuna-7B LoRA → FROZEN"] S1D["Mục tiêu: học căn chỉnh\nvisual ↔ text embedding space"] end subgraph S2["Stage 2 — Fine-tuning\nset_stage2_mode()"] direction TB S2A["🔵 BioViL-T → FROZEN"] S2B["🟢 MLP Projection → TRAIN ✓"] S2C["🔴 Vicuna-7B LoRA → TRAIN ✓\n(chỉ LoRA adapters, ~0.1% params)"] S2D["Mục tiêu: học 3 tasks\nfindings / impression / VQA"] end S1 -->|"Projection converged"| S2 style S1 fill:#1e3a2a,stroke:#4ade80,color:#e2e8f0 style S2 fill:#3a1e2a,stroke:#f472b6,color:#e2e8f0
5. Inference & Evaluation
flowchart LR subgraph INF["generate() — cxr_vlm.py"] direction TB I1["Encode image → 32 visual tokens"] I2["Build prompt theo task"] I3["llm.generate()\ngreedy / beam search"] I4["Split tại 'ASSISTANT:'\n→ response text"] I1 --> I2 --> I3 --> I4 end subgraph EVAL["Evaluation — evaluation/"] direction TB EV1["Findings / Impression:\nBLEU-4, ROUGE-L, BERTScore, ClinicalF1"] EV2["VQA:\nAccuracy, F1"] end subgraph OUT2["Output theo task"] direction TB O1["findings: 'The lungs are clear...'"] O2["impression: 'No acute process.'"] O3["vqa: 'Yes, mild pleural effusion.'"] end INF --> OUT2 OUT2 --> EVAL style INF fill:#1e3a5f,stroke:#38bdf8,color:#e2e8f0 style EVAL fill:#1e3a2a,stroke:#4ade80,color:#e2e8f0 style OUT2 fill:#3b2a1e,stroke:#f97316,color:#e2e8f0
BioViL-T (frozen)
MLP Projection (trained)
CheXpert Classifier (frozen)
Vicuna-7B + LoRA (LoRA trained)
Data / Output