Spaces:

Sukrati
/

MedRAG

Sleeping

App Files Files Community

Sukrati commited on Mar 12

Commit

345576d

0 Parent(s):

Deploy MedRAG to Hugging Face Space v4

Browse files

Files changed (17) hide show

.dockerignore +10 -0
.gitignore +18 -0
.streamlit/config.toml +5 -0
Dockerfile +30 -0
MedRAG.ipynb +0 -0
README.md +116 -0
app.py +316 -0
data_downloader.py +259 -0
download_assets.py +121 -0
gallery_builder.py +406 -0
render.yaml +20 -0
requirements-space.txt +11 -0
requirements.txt +31 -0
rewrite_metadata.py +43 -0
start.sh +9 -0
test_visual_search.py +308 -0
visual_search.py +358 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,10 @@

+.git
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.DS_Store
+data/
+index/
+*.zip
+render.yaml

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+.DS_Store
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.ipynb_checkpoints
+# Data and indexes
+data/
+index/
+*.zip
+embeddings_heatmap.png
+embeddings_pca.png
+embeddings_raw.png
+# Large datasets
+chexpert_full/

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,5 @@

+[server]
+enableCORS = false
+enableXsrfProtection = false
+maxUploadSize = 200
+headless = true

Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM python:3.10-slim
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PIP_NO_CACHE_DIR=1
+ENV DATA_DIR=/tmp/medrag_data
+ENV HF_HOME=/tmp/hf_cache
+ENV PREFETCH_MODEL=1
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    libglib2.0-0 \
+    libsm6 \
+    libxext6 \
+    libxrender1 \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements-space.txt ./
+RUN pip install --index-url https://download.pytorch.org/whl/cpu torch torchvision \
+    && pip install -r requirements-space.txt
+COPY . .
+RUN chmod +x /app/start.sh
+EXPOSE 7860
+CMD ["/app/start.sh"]

MedRAG.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+---
+title: MedRAG Diagnostic Assistant
+emoji: 🩺
+colorFrom: blue
+colorTo: gray
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# MedRAG
+MedRAG is a multimodal chest X-ray retrieval and diagnostic-assistance app built on:
+- BiomedCLIP for image embeddings and zero-shot disease scoring
+- FAISS for similar-case retrieval
+- a crosscheck layer that combines classifier output with retrieved case evidence
+- Streamlit for the application UI
+The current app supports:
+- chest X-ray upload
+- sample-image testing
+- similar-case retrieval from the indexed gallery
+- zero-shot disease probability ranking
+- retrieval-supported clinical assessment text
+- Hugging Face Spaces deployment through Docker
+## Current App Flow
+1. The user uploads a chest X-ray or selects a sample image.
+2. The app encodes the image with BiomedCLIP.
+3. FAISS retrieves the most visually similar historical cases.
+4. BiomedCLIP scores 14 CheXpert disease prompts.
+5. A crosscheck step combines retrieval agreement with classifier confidence.
+6. The app renders:
+   - generated clinical assessment
+   - ranked diagnoses
+   - top disease probabilities
+   - similar historical cases
+## Project Files
+Core app:
+- `app.py` - Streamlit UI and diagnosis pipeline
+- `visual_search.py` - FAISS-backed visual search engine
+- `download_assets.py` - downloads demo index/images and prefetches BiomedCLIP
+Index/data tooling:
+- `gallery_builder.py` - build FAISS index from chest X-ray images
+- `data_downloader.py` - download source datasets
+- `rewrite_metadata.py` - rewrite metadata filepaths for deployment
+Research/demo:
+- `MedRAG.ipynb` - notebook containing the retrieval, zero-shot classification, and crosscheck logic that the app was ported from
+Deployment:
+- `Dockerfile` - Hugging Face Spaces container build
+- `start.sh` - startup entrypoint for Spaces
+- `requirements-space.txt` - CPU-friendly dependencies for Spaces
+- `render.yaml` - older Render deployment config
+## Hugging Face Spaces
+This repo is configured for a Docker Space.
+### Deploy steps
+1. Create a new Hugging Face Space.
+2. Choose `Docker`.
+3. Push this repo to the Space remote.
+4. Let the Space build and start.
+The Space startup does the following:
+- installs CPU-only PyTorch
+- downloads the public `index.zip` and `images.zip`
+- prefetches the BiomedCLIP model
+- starts Streamlit on port `7860`
+## Local Run
+Install dependencies:
+```bash
+pip install --index-url https://download.pytorch.org/whl/cpu torch torchvision
+pip install -r requirements-space.txt
+```
+Run the app:
+```bash
+python download_assets.py
+streamlit run app.py
+```
+## Data Notes
+The deployed demo uses a reduced subset of CheXpert so it can run on free CPU infrastructure.
+Assets are pulled from public Google Drive links by default:
+- FAISS index archive
+- subset image archive
+If needed, override them with:
+- `GDRIVE_INDEX_URL`
+- `GDRIVE_IMAGES_URL`
+Optional environment variables:
+- `DATA_DIR`
+- `HF_HOME`
+- `PREFETCH_MODEL`
+## Limitations
+- The app is a diagnostic aid, not a clinical decision system.
+- Free-tier hosting will have slow cold starts.
+- The generated assessment is rule-based synthesis from model scores and retrieval support, not a physician-grade interpretation.
+- The original project plan referenced a larger multi-agent/LLM flow; the current deployed app implements the retrieval + classifier + crosscheck path from the notebook.

app.py ADDED Viewed

	@@ -0,0 +1,316 @@

+import os
+import random
+import shutil
+from collections import Counter
+from pathlib import Path
+import streamlit as st
+import torch
+from PIL import Image
+from visual_search import VisualSearchEngine
+APP_TITLE = "Multimodal Medical RAG Diagnostic Assistant"
+MODEL_ID = "hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
+DISEASE_PROMPTS = {
+    "No Finding": "Chest X-ray with no abnormality, normal findings",
+    "Enlarged Cardiomediastinum": "Chest X-ray showing enlarged cardiomediastinum",
+    "Cardiomegaly": "Chest X-ray showing cardiomegaly, enlarged heart",
+    "Lung Opacity": "Chest X-ray showing lung opacity",
+    "Lung Lesion": "Chest X-ray showing lung lesion or mass",
+    "Edema": "Chest X-ray showing pulmonary edema, fluid in lungs",
+    "Consolidation": "Chest X-ray showing consolidation in lung",
+    "Pneumonia": "Chest X-ray showing pneumonia, lung infection",
+    "Atelectasis": "Chest X-ray showing atelectasis, collapsed lung",
+    "Pneumothorax": "Chest X-ray showing pneumothorax, air in pleural space",
+    "Pleural Effusion": "Chest X-ray showing pleural effusion, fluid around lung",
+    "Pleural Other": "Chest X-ray showing pleural abnormality",
+    "Fracture": "Chest X-ray showing rib fracture or bone fracture",
+    "Support Devices": "Chest X-ray showing support devices, tubes or lines",
+}
+INPUT_GUARDRAIL_PROMPTS = {
+    "Chest X-ray": "A diagnostic chest X-ray radiograph showing the thorax and lungs",
+    "Portrait Photo": "A portrait photograph of a person or celebrity",
+    "Animal Photo": "A natural photograph of an animal or pet",
+    "Document Screenshot": "A screenshot of a document, website, or computer interface",
+    "Natural Image": "A normal everyday color photograph of a scene or object",
+}
+SYNONYMS = {
+    "Pleural Effusion": ["pleural fluid", "fluid around lung", "effusion"],
+    "Cardiomegaly": ["enlarged heart", "cardiac enlargement"],
+    "Pneumonia": ["lung infection", "consolidation"],
+    "Edema": ["fluid in lungs", "pulmonary edema"],
+    "Atelectasis": ["collapsed lung", "lung collapse"],
+    "Lung Opacity": ["opacity", "haziness", "infiltrate"],
+    "No Finding": ["normal", "no abnormality", "clear"],
+}
+def _get_paths() -> tuple[Path, Path]:
+    repo_index = Path("index").resolve()
+    data_dir = Path(os.getenv("DATA_DIR", "/tmp/medrag_data")).resolve()
+    index_dir = Path(os.getenv("INDEX_DIR", data_dir / "index")).resolve()
+    return repo_index, index_dir
+def _ensure_index_available() -> Path:
+    repo_index, index_dir = _get_paths()
+    if index_dir.exists():
+        return index_dir
+    if repo_index.exists():
+        index_dir.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copytree(repo_index, index_dir)
+        return index_dir
+    raise FileNotFoundError("FAISS index not found. Expected at DATA_DIR/index or ./index")
+@st.cache_resource(show_spinner=True)
+def _load_engine() -> VisualSearchEngine:
+    index_dir = _ensure_index_available()
+    return VisualSearchEngine(index_dir=index_dir, device="auto", top_k=5)
+@st.cache_resource(show_spinner=False)
+def _load_text_features() -> tuple[list[str], torch.Tensor]:
+    engine = _load_engine()
+    tokenizer = __import__("open_clip").get_tokenizer(MODEL_ID)
+    with torch.no_grad():
+        tokens = tokenizer(list(DISEASE_PROMPTS.values())).to(engine.device)
+        text_features = engine._model.encode_text(tokens)
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+    return list(DISEASE_PROMPTS.keys()), text_features
+@st.cache_resource(show_spinner=False)
+def _load_guardrail_features() -> tuple[list[str], torch.Tensor]:
+    engine = _load_engine()
+    tokenizer = __import__("open_clip").get_tokenizer(MODEL_ID)
+    with torch.no_grad():
+        tokens = tokenizer(list(INPUT_GUARDRAIL_PROMPTS.values())).to(engine.device)
+        text_features = engine._model.encode_text(tokens)
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+    return list(INPUT_GUARDRAIL_PROMPTS.keys()), text_features
+def _pick_sample_image(data_dir: Path) -> Path | None:
+    images_dir = data_dir / "images"
+    if not images_dir.exists():
+        return None
+    candidates = list(images_dir.glob("*.jpg")) + list(images_dir.glob("*.png")) + list(images_dir.glob("*.jpeg"))
+    if not candidates:
+        return None
+    return random.choice(candidates)
+@torch.no_grad()
+def _predict_diseases(image: Image.Image) -> dict[str, float]:
+    engine = _load_engine()
+    disease_names, text_features = _load_text_features()
+    tensor = engine._transform(image.convert("RGB")).unsqueeze(0).to(engine.device)
+    image_features = engine._model.encode_image(tensor)
+    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
+    similarities = (image_features @ text_features.T).squeeze(0)
+    probs = torch.softmax(similarities * 100, dim=0).detach().cpu().tolist()
+    results = {
+        disease_names[i]: round(float(probs[i]) * 100, 2)
+        for i in range(len(disease_names))
+    }
+    return dict(sorted(results.items(), key=lambda item: item[1], reverse=True))
+@torch.no_grad()
+def _validate_input_image(image: Image.Image) -> tuple[bool, dict[str, float]]:
+    engine = _load_engine()
+    labels, text_features = _load_guardrail_features()
+    tensor = engine._transform(image.convert("RGB")).unsqueeze(0).to(engine.device)
+    image_features = engine._model.encode_image(tensor)
+    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
+    similarities = (image_features @ text_features.T).squeeze(0)
+    probs = torch.softmax(similarities * 100, dim=0).detach().cpu().tolist()
+    scores = {labels[i]: round(float(probs[i]) * 100, 2) for i in range(len(labels))}
+    chest_score = scores["Chest X-ray"]
+    next_best = max(score for label, score in scores.items() if label != "Chest X-ray")
+    is_valid = chest_score >= 55 and chest_score > next_best
+    return is_valid, dict(sorted(scores.items(), key=lambda item: item[1], reverse=True))
+def _labels_match(disease: str, label_str: str) -> bool:
+    label_lower = label_str.lower()
+    if disease.lower() in label_lower:
+        return True
+    return any(syn.lower() in label_lower for syn in SYNONYMS.get(disease, []))
+def _crosscheck(similar_cases, disease_probs: dict[str, float]) -> list[dict]:
+    top_diseases = list(disease_probs.keys())[:5]
+    diagnosis = []
+    total_cases = max(len(similar_cases), 1)
+    for disease in top_diseases:
+        llm_prob = disease_probs[disease]
+        matching_cases = sum(1 for case in similar_cases if _labels_match(disease, case.labels))
+        gallery_support = matching_cases / total_cases
+        confidence = (llm_prob / 100 * 0.5) + (gallery_support * 0.5)
+        if gallery_support >= 0.6 and llm_prob >= 20:
+            status = "HIGH"
+        elif gallery_support >= 0.3 or llm_prob >= 15:
+            status = "MEDIUM"
+        else:
+            status = "LOW"
+        diagnosis.append({
+            "disease": disease,
+            "llm_probability": llm_prob,
+            "matching_cases": matching_cases,
+            "total_cases": total_cases,
+            "gallery_support": f"{matching_cases}/{total_cases} cases",
+            "confidence": round(confidence * 100, 1),
+            "status": status,
+        })
+    return sorted(diagnosis, key=lambda item: item["confidence"], reverse=True)
+def _positive_labels(label_str: str) -> list[str]:
+    positives = []
+    for part in label_str.split(" | "):
+        if ": Positive" in part:
+            positives.append(part.split(":")[0])
+    return positives
+def _generate_assessment(diagnosis: list[dict], similar_cases) -> str:
+    primary = diagnosis[0]
+    top_positive_labels = Counter()
+    for case in similar_cases:
+        top_positive_labels.update(_positive_labels(case.labels))
+    supporting_findings = ", ".join(label for label, _ in top_positive_labels.most_common(3)) or "no repeated positive findings"
+    differential = ", ".join(item["disease"] for item in diagnosis[1:4])
+    return f"""
+## Primary Clinical Impression
+Based on visual similarity retrieval and zero-shot disease classification, the leading impression is **{primary["disease"]}** with a combined confidence of **{primary["confidence"]}%**.
+## Evidence Summary
+- The classifier estimated **{primary["llm_probability"]}%** probability for {primary["disease"]}.
+- The retrieval engine found **{primary["gallery_support"]}** similar cases supporting this diagnosis.
+- The most repeated positive findings among retrieved cases were: **{supporting_findings}**.
+## Differential Diagnosis
+Alternative conditions to consider are **{differential}**. These remain relevant because visually similar cases include overlapping thoracic findings common across chest X-ray pathology.
+## Clinical Note
+This is a retrieval-supported decision aid, not a definitive medical diagnosis. Final interpretation should be confirmed by a radiologist or clinician.
+""".strip()
+def _run_analysis(image: Image.Image, top_k: int):
+    engine = _load_engine()
+    similar_cases = engine.search(image, top_k=top_k, load_images=False)
+    disease_probs = _predict_diseases(image)
+    diagnosis = _crosscheck(similar_cases, disease_probs)
+    assessment = _generate_assessment(diagnosis, similar_cases)
+    return similar_cases, disease_probs, diagnosis, assessment
+def _render_similar_cases(similar_cases):
+    st.markdown("### Similar Historical Cases")
+    for idx, case in enumerate(similar_cases, start=1):
+        cols = st.columns([1, 3])
+        with cols[0]:
+            if case.filepath and Path(case.filepath).exists():
+                try:
+                    st.image(Image.open(case.filepath).convert("RGB"), use_container_width=True)
+                except Exception:
+                    st.caption("Preview unavailable")
+        with cols[1]:
+            st.markdown(f"**#{idx} {case.filename}**")
+            st.write(f"Similarity: {case.similarity:.3f}")
+            positives = _positive_labels(case.labels)
+            st.write(f"Confirmed findings: {', '.join(positives) if positives else 'None'}")
+def main():
+    st.set_page_config(page_title=APP_TITLE, layout="wide")
+    st.title(APP_TITLE)
+    st.caption(
+        "Upload a chest X-ray. The system retrieves similar historical cases and generates a retrieval-supported differential diagnosis."
+    )
+    with st.sidebar:
+        st.markdown("**Index Status**")
+        try:
+            index_dir = _ensure_index_available()
+            st.write(f"Index dir: `{index_dir}`")
+            data_dir = index_dir.parent
+        except FileNotFoundError as exc:
+            st.error(str(exc))
+            return
+        top_k = st.slider("Retrieved Cases", min_value=3, max_value=20, value=5, step=1)
+        if st.button("Use Sample Image"):
+            st.session_state["sample_path"] = str(_pick_sample_image(data_dir) or "")
+        if st.button("Clear"):
+            st.session_state.pop("sample_path", None)
+            st.session_state.pop("analysis_ready", None)
+            st.rerun()
+        st.caption("First analysis can still be slow on Render free tier.")
+    uploaded = st.file_uploader("Upload Patient Chest X-Ray", type=["png", "jpg", "jpeg"])
+    sample_path = st.session_state.get("sample_path")
+    query_image = None
+    if uploaded is not None:
+        query_image = Image.open(uploaded).convert("RGB")
+        st.session_state["analysis_ready"] = True
+    elif sample_path:
+        query_image = Image.open(sample_path).convert("RGB")
+        st.session_state["analysis_ready"] = True
+    left, right = st.columns([1.05, 1.25])
+    with left:
+        st.markdown("### Input X-Ray")
+        if query_image is not None:
+            st.image(query_image, use_container_width=True)
+        else:
+            st.info("Upload an image or use the sample button.")
+    with right:
+        st.markdown("### Generated Clinical Assessment")
+        if query_image is None:
+            st.info("Run an analysis to generate the assessment.")
+            return
+        if st.button("Submit", type="primary") or st.session_state.get("analysis_ready"):
+            with st.spinner("Running retrieval, classification, and crosscheck..."):
+                is_valid_xray, input_scores = _validate_input_image(query_image)
+                if not is_valid_xray:
+                    st.error("This tool only supports chest X-ray images. Please upload a chest radiograph.")
+                    st.markdown("### Input Validation")
+                    for label, score in list(input_scores.items())[:3]:
+                        st.write(f"{label}: {score}%")
+                    st.session_state["analysis_ready"] = False
+                    return
+                similar_cases, disease_probs, diagnosis, assessment = _run_analysis(query_image, top_k)
+            st.markdown(assessment)
+            st.markdown("### Ranked Diagnoses")
+            for item in diagnosis:
+                st.write(
+                    f"**{item['disease']}** | classifier {item['llm_probability']}% | "
+                    f"gallery {item['gallery_support']} | confidence {item['confidence']}% [{item['status']}]"
+                )
+            st.markdown("### Top Disease Probabilities")
+            for disease, prob in list(disease_probs.items())[:5]:
+                st.write(f"{disease}: {prob}%")
+            _render_similar_cases(similar_cases)
+            st.session_state["analysis_ready"] = False
+if __name__ == "__main__":
+    main()

data_downloader.py ADDED Viewed

	@@ -0,0 +1,259 @@

+"""
+data_downloader.py
+──────────────────
+Downloads the NIH ChestX-ray14 dataset sample (5,606 images, ~1.2 GB).
+This is the public domain dataset used to build the visual_db.index.
+The NIH dataset contains 14 disease labels per image in the CSV metadata:
+  Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule,
+  Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis,
+  Pleural_Thickening, Hernia  (plus "No Finding")
+Usage:
+    python data_downloader.py --output_dir ./data
+"""
+import os
+import sys
+import time
+import zipfile
+import argparse
+import requests
+import pandas as pd
+from pathlib import Path
+from tqdm import tqdm
+# ── NIH ChestX-ray14 public download URLs ─────────────────────────────────────
+# Source: https://nihcc.app.box.com/v/ChestXray-NIHCC
+# The NIH provides 12 batch ZIPs + 1 metadata CSV.
+# We use only the FIRST batch (images_001.tar.gz → ~1.1 GB, 4,999 images)
+# for a fast bootstrap.  Add more batches for larger gallery.
+NIH_METADATA_URL = (
+    "https://raw.githubusercontent.com/ieee8023/covid-chestxray-dataset/"
+    "master/metadata.csv"  # placeholder – real URL below
+)
+# Real NIH metadata (hosted on Kaggle mirror for convenience)
+NIH_KAGGLE_METADATA = "https://raw.githubusercontent.com/mlmed/torchxrayvision/master/torchxrayvision/data_dicts/nih_chest_xray_dict.json"
+# ── Open-I (Indiana University) – ALWAYS freely available, no login ───────────
+# 7,470 frontal X-rays  ~900 MB
+OPENI_BASE = "https://openi.nlm.nih.gov/imgs/collections/"
+OPENI_ARCHIVE = "NLMCXR_png.tgz"  # full archive
+OPENI_METADATA_URL = "https://openi.nlm.nih.gov/api/search?q=&it=x&m=1&n=500"
+# ── Lightweight fallback: Kaggle chest-xray-pneumonia (1.15 GB) ───────────────
+# https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
+# Requires kaggle CLI auth token.
+SUPPORTED_SOURCES = ["openi", "nih_sample", "local"]
+def download_with_progress(url: str, dest_path: Path, chunk_size: int = 8192) -> bool:
+    """Stream-download a file with a tqdm progress bar."""
+    try:
+        resp = requests.get(url, stream=True, timeout=60)
+        resp.raise_for_status()
+        total = int(resp.headers.get("content-length", 0))
+        dest_path.parent.mkdir(parents=True, exist_ok=True)
+        with open(dest_path, "wb") as f, tqdm(
+            total=total, unit="B", unit_scale=True,
+            desc=dest_path.name, ncols=80
+        ) as bar:
+            for chunk in resp.iter_content(chunk_size=chunk_size):
+                f.write(chunk)
+                bar.update(len(chunk))
+        return True
+    except Exception as e:
+        print(f"[ERROR] Download failed: {e}")
+        return False
+def download_openi(output_dir: Path) -> Path:
+    """
+    Download Open-I Indiana University chest X-ray PNG collection.
+    Returns the directory containing .png images.
+    """
+    import tarfile
+    output_dir.mkdir(parents=True, exist_ok=True)
+    archive_path = output_dir / OPENI_ARCHIVE
+    images_dir = output_dir / "openi_images"
+    if images_dir.exists() and any(images_dir.glob("*.png")):
+        print(f"[SKIP] Open-I images already present at {images_dir}")
+        return images_dir
+    print("=" * 60)
+    print("Downloading Open-I Indiana X-ray dataset (~900 MB)...")
+    print("Source: National Library of Medicine (public domain)")
+    print("=" * 60)
+    url = OPENI_BASE + OPENI_ARCHIVE
+    if not download_with_progress(url, archive_path):
+        raise RuntimeError("Failed to download Open-I archive.")
+    print(f"Extracting to {images_dir}...")
+    images_dir.mkdir(exist_ok=True)
+    with tarfile.open(archive_path, "r:gz") as tar:
+        tar.extractall(path=images_dir)
+    archive_path.unlink()  # free disk space
+    print(f"[OK] Open-I images extracted → {images_dir}")
+    return images_dir
+def download_nih_sample(output_dir: Path, max_images: int = 5000) -> Path:
+    """
+    Download NIH ChestX-ray14 batch_01 (~4,999 images, ~1.1 GB).
+    Uses direct Box.com links published by NIH.
+    """
+    import tarfile
+    NIH_BATCH1_URL = (
+        "https://nihcc.box.com/shared/static/"
+        "vfk49d74nhbxq3nqjg0900w5nvkorp5c.gz"
+    )
+    output_dir.mkdir(parents=True, exist_ok=True)
+    archive_path = output_dir / "nih_images_001.tar.gz"
+    images_dir = output_dir / "nih_images"
+    if images_dir.exists() and any(images_dir.glob("*.png")):
+        print(f"[SKIP] NIH images already present at {images_dir}")
+        return images_dir
+    print("=" * 60)
+    print("Downloading NIH ChestX-ray14 Batch 1 (~1.1 GB)...")
+    print("Source: NIH Clinical Center (CC0 license)")
+    print("=" * 60)
+    if not download_with_progress(NIH_BATCH1_URL, archive_path):
+        raise RuntimeError(
+            "Failed to download NIH batch. "
+            "Try manual download from: https://nihcc.app.box.com/v/ChestXray-NIHCC"
+        )
+    print(f"Extracting to {images_dir}...")
+    images_dir.mkdir(exist_ok=True)
+    with tarfile.open(archive_path, "r:gz") as tar:
+        members = tar.getmembers()[:max_images]
+        tar.extractall(path=images_dir, members=members)
+    archive_path.unlink()
+    print(f"[OK] NIH images extracted → {images_dir}")
+    return images_dir
+def download_nih_metadata(output_dir: Path) -> Path:
+    """Download the NIH ChestX-ray14 labels CSV."""
+    META_URL = (
+        "https://raw.githubusercontent.com/mlmed/torchxrayvision/"
+        "master/tests/test_data/nih_data_entry_small.csv"
+    )
+    # Full metadata (108,948 rows):
+    FULL_META_URL = (
+        "https://raw.githubusercontent.com/ieee8023/chexnet-dataset/"
+        "master/Data_Entry_2017.csv"
+    )
+    dest = output_dir / "nih_metadata.csv"
+    if dest.exists():
+        return dest
+    print("Downloading NIH metadata CSV...")
+    download_with_progress(FULL_META_URL, dest)
+    return dest
+def scan_local_images(image_dir: Path) -> list[Path]:
+    """Return all PNG/JPG images in a directory (recursive)."""
+    extensions = {".png", ".jpg", ".jpeg"}
+    images = [
+        p for p in image_dir.rglob("*")
+        if p.suffix.lower() in extensions
+    ]
+    print(f"[SCAN] Found {len(images):,} images in {image_dir}")
+    return images
+def build_metadata_csv(
+    image_dir: Path,
+    nih_csv_path: Path | None,
+    output_path: Path
+) -> pd.DataFrame:
+    """
+    Build a unified metadata CSV:
+        filename | filepath | labels | source
+    Works whether NIH labels CSV is available or not.
+    """
+    images = scan_local_images(image_dir)
+    rows = []
+    label_lookup = {}
+    if nih_csv_path and nih_csv_path.exists():
+        df_nih = pd.read_csv(nih_csv_path)
+        # NIH CSV cols: Image Index, Finding Labels, Patient ID, ...
+        for _, row in df_nih.iterrows():
+            label_lookup[row["Image Index"]] = row["Finding Labels"]
+    for img_path in images:
+        fname = img_path.name
+        labels = label_lookup.get(fname, "Unknown")
+        rows.append({
+            "filename": fname,
+            "filepath": str(img_path.resolve()),
+            "labels": labels,
+            "source": "NIH" if label_lookup else "Unknown",
+        })
+    df = pd.DataFrame(rows)
+    df.to_csv(output_path, index=False)
+    print(f"[OK] Metadata saved → {output_path}  ({len(df):,} rows)")
+    return df
+def main():
+    parser = argparse.ArgumentParser(
+        description="Download chest X-ray dataset for gallery builder"
+    )
+    parser.add_argument(
+        "--source", choices=SUPPORTED_SOURCES, default="openi",
+        help="Dataset source (default: openi – no login required)"
+    )
+    parser.add_argument(
+        "--output_dir", type=Path, default=Path("./data"),
+        help="Directory to save images and metadata"
+    )
+    parser.add_argument(
+        "--local_dir", type=Path, default=None,
+        help="Path to existing local image folder (use with --source local)"
+    )
+    args = parser.parse_args()
+    output_dir: Path = args.output_dir.resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if args.source == "openi":
+        images_dir = download_openi(output_dir)
+    elif args.source == "nih_sample":
+        images_dir = download_nih_sample(output_dir)
+        nih_meta = download_nih_metadata(output_dir)
+        build_metadata_csv(images_dir, nih_meta, output_dir / "metadata.csv")
+        return
+    elif args.source == "local":
+        if not args.local_dir:
+            print("[ERROR] --local_dir is required when --source=local")
+            sys.exit(1)
+        images_dir = args.local_dir.resolve()
+    else:
+        print(f"[ERROR] Unknown source: {args.source}")
+        sys.exit(1)
+    build_metadata_csv(images_dir, None, output_dir / "metadata.csv")
+    print("\n✅ Dataset ready. Next step:")
+    print(f"   python gallery_builder.py --image_dir {images_dir} --output_dir ./index")
+if __name__ == "__main__":
+    main()

download_assets.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""
+download_assets.py
+------------------
+Downloads index/ and image assets from Google Drive into /var/data.
+Env vars:
+  GDRIVE_INDEX_URL  - share link or direct download url for a zip/tar of index/
+  GDRIVE_IMAGES_URL - share link or direct download url for a zip/tar of images/
+  DATA_DIR          - base path (default: /var/data)
+"""
+import os
+import shutil
+import tarfile
+import zipfile
+from pathlib import Path
+import gdown
+from huggingface_hub import snapshot_download
+BIOMEDCLIP_REPO = "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
+DEFAULT_INDEX_URL = "https://drive.google.com/uc?id=1NwEac0s_qah8L27RO-aFIz2PRfvXC9j0"
+DEFAULT_IMAGES_URL = "https://drive.google.com/uc?id=1LAMNffnw3kFHZXvY9ySR62VxlaChRXyv"
+def _download(url: str, dest: Path) -> Path:
+    dest.parent.mkdir(parents=True, exist_ok=True)
+    if dest.exists():
+        return dest
+    gdown.download(url, str(dest), quiet=False)
+    return dest
+def _extract(archive: Path, target_dir: Path) -> None:
+    target_dir.mkdir(parents=True, exist_ok=True)
+    if zipfile.is_zipfile(archive):
+        with zipfile.ZipFile(archive, "r") as zf:
+            zf.extractall(target_dir)
+    elif tarfile.is_tarfile(archive) or archive.name.endswith((".tgz", ".tar.gz", ".gz")):
+        with tarfile.open(archive, "r:*") as tf:
+            tf.extractall(target_dir)
+    else:
+        raise ValueError(f"Unsupported archive: {archive}")
+def _ensure_dir(path: Path) -> None:
+    path.mkdir(parents=True, exist_ok=True)
+def _pick_data_dir() -> Path:
+    env_dir = os.getenv("DATA_DIR")
+    if env_dir:
+        return Path(env_dir).resolve()
+    for candidate in (Path("/var/data"), Path("/tmp/medrag_data")):
+        try:
+            candidate.mkdir(parents=True, exist_ok=True)
+            return candidate
+        except Exception:
+            continue
+    return Path("/tmp/medrag_data").resolve()
+def _prefetch_biomedclip() -> None:
+    cache_dir = Path(os.getenv("HF_HOME", "/tmp/hf_cache")).resolve()
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    snapshot_download(
+        repo_id=BIOMEDCLIP_REPO,
+        cache_dir=str(cache_dir),
+        local_dir_use_symlinks=False,
+    )
+    print(f"BiomedCLIP cached in {cache_dir}")
+def main():
+    data_dir = _pick_data_dir()
+    index_dir = data_dir / "index"
+    images_dir = data_dir / "images"
+    index_url = os.getenv("GDRIVE_INDEX_URL", DEFAULT_INDEX_URL)
+    images_url = os.getenv("GDRIVE_IMAGES_URL", DEFAULT_IMAGES_URL)
+    _ensure_dir(data_dir)
+    if index_dir.exists() and any(index_dir.iterdir()):
+        print(f"Index already present at {index_dir}")
+    elif index_url:
+        archive = data_dir / "index_archive.zip"
+        archive = _download(index_url, archive)
+        _extract(archive, index_dir)
+        print(f"Index extracted to {index_dir}")
+    else:
+        print("GDRIVE_INDEX_URL not set; index not downloaded.")
+    if images_dir.exists() and any(images_dir.iterdir()):
+        print(f"Images already present at {images_dir}")
+    elif images_url:
+        archive = data_dir / "images_archive.zip"
+        archive = _download(images_url, archive)
+        _extract(archive, images_dir)
+        print(f"Images extracted to {images_dir}")
+    else:
+        print("GDRIVE_IMAGES_URL not set; images not downloaded.")
+    # cleanup
+    for f in [data_dir / "index_archive.zip", data_dir / "images_archive.zip"]:
+        if f.exists():
+            try:
+                if f.is_file():
+                    f.unlink()
+                else:
+                    shutil.rmtree(f)
+            except Exception:
+                pass
+    if os.getenv("PREFETCH_MODEL", "1") == "1":
+        _prefetch_biomedclip()
+if __name__ == "__main__":
+    main()

gallery_builder.py ADDED Viewed

	@@ -0,0 +1,406 @@

+"""
+gallery_builder.py
+──────────────────
+Builds the visual search database for Medical X-ray RAG.
+Pipeline:
+  1. Load all X-ray images from --image_dir
+  2. Encode each image → 512-dim vector via BiomedCLIP
+  3. Normalize + store in FAISS IndexFlatIP (cosine similarity via dot product)
+  4. Save:   visual_db.index   (FAISS binary)
+             metadata.json     (filename → {path, labels, idx})
+             embeddings.npy    (raw numpy array, optional backup)
+BiomedCLIP:
+  microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
+  Trained on 15M biomedical image-caption pairs from PubMed Central.
+  Zero-shot performance on CheXpert = 0.85+ AUC (no fine-tuning needed).
+Usage:
+    python gallery_builder.py \
+        --image_dir  ./data/openi_images \
+        --output_dir ./index \
+        --batch_size 64 \
+        --device     cpu
+    # Resume interrupted build:
+    python gallery_builder.py --image_dir ./data/openi_images --resume
+Output files:
+    ./index/visual_db.index    ← FAISS binary index
+    ./index/metadata.json      ← id → {filename, filepath, labels}
+    ./index/embeddings.npy     ← (N, 512) float32 array
+    ./index/build_stats.json   ← timing + counts
+"""
+import os
+import sys
+import json
+import time
+import argparse
+import logging
+import numpy as np
+from pathlib import Path
+from typing import Optional
+import torch
+from torch.utils.data import Dataset, DataLoader
+from PIL import Image, UnidentifiedImageError
+import faiss
+import open_clip
+from tqdm import tqdm
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s  %(levelname)-7s  %(message)s",
+    datefmt="%H:%M:%S",
+)
+log = logging.getLogger(__name__)
+# ── Constants ──────────────────────────────────────────────────────────────────
+BIOMEDCLIP_MODEL  = "hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
+EMBED_DIM         = 512
+SUPPORTED_EXTS    = {".png", ".jpg", ".jpeg", ".dcm"}
+INDEX_FILE        = "visual_db.index"
+METADATA_FILE     = "metadata.json"
+EMBEDDINGS_FILE   = "embeddings.npy"
+STATS_FILE        = "build_stats.json"
+# ── Dataset ────────────────────────────────────────────────────────────────────
+class XRayDataset(Dataset):
+    """
+    Lazy-loading dataset for chest X-ray images.
+    Applies BiomedCLIP preprocessing (resize 224, normalize).
+    Skips corrupt/unreadable files gracefully.
+    """
+    def __init__(
+        self,
+        image_paths: list[Path],
+        transform,
+        metadata_csv_path: Optional[Path] = None,
+    ):
+        self.paths = image_paths
+        self.transform = transform
+        self.label_map: dict[str, str] = {}
+        # Optional: load NIH/CheXpert labels CSV
+        if metadata_csv_path and metadata_csv_path.exists():
+            import pandas as pd
+            df = pd.read_csv(metadata_csv_path)
+            if "filename" in df.columns and "labels" in df.columns:
+                self.label_map = dict(zip(df["filename"], df["labels"].fillna("Unknown")))
+    def __len__(self):
+        return len(self.paths)
+    def __getitem__(self, idx: int):
+        path = self.paths[idx]
+        try:
+            img = Image.open(path).convert("RGB")
+            tensor = self.transform(img)
+            label = self.label_map.get(path.name, "Unknown")
+            return tensor, str(path), label, True   # (tensor, path, label, valid)
+        except (UnidentifiedImageError, OSError, Exception) as e:
+            log.warning(f"Skipping corrupt image: {path.name}  ({e})")
+            # Return a zero tensor so DataLoader batch stays uniform
+            dummy = torch.zeros(3, 224, 224)
+            return dummy, str(path), "CORRUPT", False
+def collate_skip_corrupt(batch):
+    """Custom collate: filter out corrupt images before batching."""
+    valid = [(t, p, l) for t, p, l, ok in batch if ok]
+    if not valid:
+        return None
+    tensors, paths, labels = zip(*valid)
+    return torch.stack(tensors), list(paths), list(labels)
+# ── Model loader ───────────────────────────────────────────────────────────────
+def load_biomedclip(device: str):
+    """
+    Load BiomedCLIP vision encoder from HuggingFace hub.
+    Returns (model, transform) where model outputs 512-dim image embeddings.
+    """
+    log.info("Loading BiomedCLIP from HuggingFace hub (first run downloads ~350 MB)...")
+    try:
+        model, _, transform = open_clip.create_model_and_transforms(
+            BIOMEDCLIP_MODEL
+        )
+        model = model.to(device).eval()
+        log.info(f"BiomedCLIP loaded  ✓  device={device}")
+        return model, transform
+    except Exception as e:
+        log.error(f"Failed to load BiomedCLIP: {e}")
+        log.error("Ensure open-clip-torch is installed:  pip install open-clip-torch")
+        raise
+# ── Embedding engine ───────────────────────────────────────────────────────────
+@torch.no_grad()
+def encode_batch(model, image_tensors: torch.Tensor, device: str) -> np.ndarray:
+    """Encode a batch of image tensors → L2-normalized embeddings (N, 512)."""
+    image_tensors = image_tensors.to(device)
+    features = model.encode_image(image_tensors)
+    # L2 normalize → cosine similarity = dot product
+    features = features / features.norm(dim=-1, keepdim=True)
+    return features.cpu().numpy().astype(np.float32)
+# ── FAISS index builder ────────────────────────────────────────────────────────
+def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatIP:
+    """
+    Build FAISS IndexFlatIP (inner product = cosine similarity after L2-norm).
+    For galleries > 100K images, swap to IndexIVFFlat for 10x faster search.
+    """
+    n, d = embeddings.shape
+    log.info(f"Building FAISS index  ({n:,} vectors × {d} dims)")
+    if n < 10_000:
+        # Exact search — best for < 10K images
+        index = faiss.IndexFlatIP(d)
+    else:
+        # Approximate search — needed for large galleries
+        nlist = min(256, n // 39)   # IVF rule: nlist ≈ sqrt(N)
+        quantizer = faiss.IndexFlatIP(d)
+        index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
+        log.info(f"Training IVF index with nlist={nlist}...")
+        index.train(embeddings)
+        index.nprobe = 16  # search 16 cells at query time (accuracy vs speed)
+    index.add(embeddings)
+    log.info(f"FAISS index built  ✓  total vectors: {index.ntotal:,}")
+    return index
+# ── Resume support ─────────────────────────────────────────────────────────────
+def load_checkpoint(output_dir: Path) -> tuple[np.ndarray | None, dict | None, int]:
+    """Load partial embeddings + metadata if build was interrupted."""
+    emb_ckpt = output_dir / "embeddings_checkpoint.npy"
+    meta_ckpt = output_dir / "metadata_checkpoint.json"
+    if emb_ckpt.exists() and meta_ckpt.exists():
+        embeddings = np.load(emb_ckpt)
+        with open(meta_ckpt) as f:
+            metadata = json.load(f)
+        start_idx = len(metadata)
+        log.info(f"[RESUME] Found checkpoint with {start_idx:,} images. Continuing...")
+        return embeddings, metadata, start_idx
+    return None, None, 0
+def save_checkpoint(output_dir: Path, embeddings: np.ndarray, metadata: dict):
+    """Save incremental checkpoint every N batches."""
+    np.save(output_dir / "embeddings_checkpoint.npy", embeddings)
+    with open(output_dir / "metadata_checkpoint.json", "w") as f:
+        json.dump(metadata, f)
+# ── Main pipeline ──────────────────────────────────────────────────────────────
+def build_gallery(
+    image_dir: Path,
+    output_dir: Path,
+    batch_size: int = 64,
+    device: str = "auto",
+    metadata_csv: Optional[Path] = None,
+    resume: bool = False,
+    checkpoint_every: int = 500,
+):
+    """
+    Full pipeline: images → BiomedCLIP embeddings → FAISS index.
+    Args:
+        image_dir:        Directory containing X-ray images (scanned recursively)
+        output_dir:       Where to save visual_db.index + metadata.json
+        batch_size:       Images per GPU/CPU batch (lower if OOM)
+        device:           "cuda", "cpu", or "auto"
+        metadata_csv:     Optional CSV with columns: filename, labels
+        resume:           Resume from last checkpoint if available
+        checkpoint_every: Save checkpoint every N images
+    """
+    t_start = time.time()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # ── Resolve device ─────────────────────────────────────────────────────────
+    if device == "auto":
+        device = "cuda" if torch.cuda.is_available() else (
+            "mps"  if torch.backends.mps.is_available() else "cpu"
+        )
+    log.info(f"Device: {device}")
+    # ── Collect image paths ────────────────────────────────────────────────────
+    all_images = sorted([
+        p for p in image_dir.rglob("*")
+        if p.suffix.lower() in SUPPORTED_EXTS
+    ])
+    if not all_images:
+        raise FileNotFoundError(f"No images found in {image_dir}")
+    log.info(f"Found {len(all_images):,} images in {image_dir}")
+    # ── Resume checkpoint ──────────────────────────────────────────────────────
+    existing_emb, existing_meta, start_idx = (None, None, 0)
+    if resume:
+        existing_emb, existing_meta, start_idx = load_checkpoint(output_dir)
+    images_to_process = all_images[start_idx:]
+    log.info(f"Images to process: {len(images_to_process):,}")
+    # ── Load BiomedCLIP ────────────────────────────────────────────────────────
+    model, transform = load_biomedclip(device)
+    # ── Dataset + DataLoader ───────────────────────────────────────────────────
+    dataset = XRayDataset(images_to_process, transform, metadata_csv)
+    loader = DataLoader(
+        dataset,
+        batch_size=batch_size,
+        num_workers=min(4, os.cpu_count() or 1),
+        pin_memory=(device == "cuda"),
+        collate_fn=collate_skip_corrupt,
+        prefetch_factor=2 if device == "cuda" else None,
+    )
+    # ── Accumulate embeddings ──────────────────────────────────────────────────
+    all_embeddings: list[np.ndarray] = []
+    all_metadata: dict = existing_meta or {}   # id (int) → {filename, filepath, labels}
+    global_idx = start_idx
+    skipped = 0
+    log.info("Encoding images with BiomedCLIP...")
+    for batch in tqdm(loader, desc="Encoding", unit="batch", ncols=80):
+        if batch is None:
+            continue
+        tensors, paths, labels = batch
+        batch_emb = encode_batch(model, tensors, device)
+        for i, (path, label) in enumerate(zip(paths, labels)):
+            all_embeddings.append(batch_emb[i])
+            all_metadata[str(global_idx)] = {
+                "filename": Path(path).name,
+                "filepath": path,
+                "labels":   label,
+                "idx":      global_idx,
+            }
+            global_idx += 1
+        # Periodic checkpoint
+        if global_idx % checkpoint_every < batch_size:
+            combined_emb = np.vstack(
+                [existing_emb] + all_embeddings
+                if existing_emb is not None else all_embeddings
+            )
+            save_checkpoint(output_dir, combined_emb, all_metadata)
+            log.info(f"  Checkpoint saved at {global_idx:,} images")
+    if not all_embeddings:
+        raise RuntimeError("No valid images were encoded. Check image directory.")
+    # ── Stack all embeddings ───────────────────────────────────────────────────
+    new_embeddings = np.vstack(all_embeddings)
+    if existing_emb is not None:
+        final_embeddings = np.vstack([existing_emb, new_embeddings])
+    else:
+        final_embeddings = new_embeddings
+    log.info(f"Embeddings shape: {final_embeddings.shape}")
+    # ── Build + save FAISS index ───────────────────────────────────────────────
+    index = build_faiss_index(final_embeddings)
+    index_path = output_dir / INDEX_FILE
+    faiss.write_index(index, str(index_path))
+    log.info(f"FAISS index saved → {index_path}  ({index_path.stat().st_size / 1e6:.1f} MB)")
+    # ── Save metadata ──────────────────────────────────────────────────────────
+    meta_path = output_dir / METADATA_FILE
+    with open(meta_path, "w") as f:
+        json.dump(all_metadata, f, indent=2)
+    log.info(f"Metadata saved   → {meta_path}")
+    # ── Save raw embeddings (optional, useful for offline analysis) ────────────
+    emb_path = output_dir / EMBEDDINGS_FILE
+    np.save(emb_path, final_embeddings)
+    log.info(f"Embeddings saved → {emb_path}  ({emb_path.stat().st_size / 1e6:.1f} MB)")
+    # ── Clean up checkpoints ───────────────────────────────────────────────────
+    for ckpt in ["embeddings_checkpoint.npy", "metadata_checkpoint.json"]:
+        ckpt_path = output_dir / ckpt
+        if ckpt_path.exists():
+            ckpt_path.unlink()
+    # ── Build stats ────────────────────────────────────────────────────────────
+    elapsed = time.time() - t_start
+    stats = {
+        "total_images":   index.ntotal,
+        "skipped":        skipped,
+        "embed_dim":      EMBED_DIM,
+        "model":          BIOMEDCLIP_MODEL,
+        "index_type":     type(index).__name__,
+        "build_time_sec": round(elapsed, 1),
+        "throughput_img_per_sec": round(index.ntotal / elapsed, 1),
+        "index_size_mb":  round(index_path.stat().st_size / 1e6, 2),
+        "device":         device,
+    }
+    with open(output_dir / STATS_FILE, "w") as f:
+        json.dump(stats, f, indent=2)
+    log.info("=" * 55)
+    log.info(f"✅ Gallery build complete!")
+    log.info(f"   Images indexed : {index.ntotal:,}")
+    log.info(f"   Build time     : {elapsed:.0f}s  ({stats['throughput_img_per_sec']} img/s)")
+    log.info(f"   Index size     : {stats['index_size_mb']} MB")
+    log.info(f"   Output dir     : {output_dir.resolve()}")
+    log.info("=" * 55)
+    return index, all_metadata
+# ── CLI ────────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(
+        description="Build FAISS visual search index from chest X-ray images"
+    )
+    parser.add_argument(
+        "--image_dir", type=Path, required=True,
+        help="Root directory containing X-ray images (searched recursively)"
+    )
+    parser.add_argument(
+        "--output_dir", type=Path, default=Path("./index"),
+        help="Where to save visual_db.index + metadata.json (default: ./index)"
+    )
+    parser.add_argument(
+        "--batch_size", type=int, default=64,
+        help="Batch size for encoding. Reduce to 16 if CPU RAM < 8 GB (default: 64)"
+    )
+    parser.add_argument(
+        "--device", choices=["auto", "cuda", "cpu", "mps"], default="auto",
+        help="Compute device (default: auto-detect)"
+    )
+    parser.add_argument(
+        "--metadata_csv", type=Path, default=None,
+        help="Optional CSV with columns: filename, labels"
+    )
+    parser.add_argument(
+        "--resume", action="store_true",
+        help="Resume from last checkpoint if build was interrupted"
+    )
+    parser.add_argument(
+        "--checkpoint_every", type=int, default=500,
+        help="Save checkpoint every N images (default: 500)"
+    )
+    args = parser.parse_args()
+    build_gallery(
+        image_dir=args.image_dir.resolve(),
+        output_dir=args.output_dir.resolve(),
+        batch_size=args.batch_size,
+        device=args.device,
+        metadata_csv=args.metadata_csv,
+        resume=args.resume,
+        checkpoint_every=args.checkpoint_every,
+    )
+if __name__ == "__main__":
+    main()

render.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+services:
+  - type: web
+    name: medrag-app
+    env: python
+    plan: free
+    buildCommand: pip install --index-url https://download.pytorch.org/whl/cpu torch torchvision && pip install -r requirements.txt
+    startCommand: python download_assets.py && streamlit run app.py --server.port $PORT --server.address 0.0.0.0
+    envVars:
+      - key: PYTHONUNBUFFERED
+        value: "1"
+      - key: DATA_DIR
+        value: /tmp/medrag_data
+      - key: HF_HOME
+        value: /tmp/hf_cache
+      - key: PREFETCH_MODEL
+        value: "1"
+      - key: GDRIVE_INDEX_URL
+        value: ""
+      - key: GDRIVE_IMAGES_URL
+        value: ""

requirements-space.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+open-clip-torch>=2.24.0
+faiss-cpu>=1.7.4
+Pillow>=10.0.0
+numpy>=1.24.0
+tqdm>=4.66.0
+requests>=2.31.0
+pandas>=2.0.0
+streamlit>=1.31.0
+gdown>=5.1.0
+huggingface-hub>=0.28.0
+transformers>=4.30.0,<5

requirements.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+# Gallery Builder – Python Dependencies
+# Install with: pip install -r requirements.txt
+#
+# GPU support (recommended for faster encoding):
+#   pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
+#
+# CPU only:
+#   pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
+# Core ML
+torch>=2.1.0
+torchvision>=0.16.0
+open-clip-torch>=2.24.0          # BiomedCLIP lives here
+# Vector database
+faiss-cpu>=1.7.4                  # swap for faiss-gpu if CUDA available
+# Image processing
+Pillow>=10.0.0
+numpy>=1.24.0
+# Utilities
+tqdm>=4.66.0
+requests>=2.31.0
+pandas>=2.0.0
+streamlit>=1.31.0
+gdown>=5.1.0
+# Testing
+pytest>=7.4.0
+pytest-cov>=4.1.0

rewrite_metadata.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""
+rewrite_metadata.py
+-------------------
+Utility to rewrite metadata.json filepaths for deployment.
+Example:
+  python rewrite_metadata.py \
+    --index_dir ./index \
+    --from_prefix "/Users/you/MedRAG/data/train" \
+    --to_prefix "/var/data/images"
+"""
+import argparse
+import json
+from pathlib import Path
+def main():
+    parser = argparse.ArgumentParser(description="Rewrite metadata.json filepaths")
+    parser.add_argument("--index_dir", type=Path, default=Path("./index"))
+    parser.add_argument("--from_prefix", required=True)
+    parser.add_argument("--to_prefix", required=True)
+    args = parser.parse_args()
+    meta_path = args.index_dir / "metadata.json"
+    if not meta_path.exists():
+        raise FileNotFoundError(f"metadata.json not found: {meta_path}")
+    data = json.loads(meta_path.read_text())
+    updated = 0
+    for _, entry in data.items():
+        fp = entry.get("filepath", "")
+        if fp.startswith(args.from_prefix):
+            entry["filepath"] = fp.replace(args.from_prefix, args.to_prefix, 1)
+            updated += 1
+    meta_path.write_text(json.dumps(data, indent=2))
+    print(f"Rewrote {updated} filepaths in {meta_path}")
+if __name__ == "__main__":
+    main()

start.sh ADDED Viewed

	@@ -0,0 +1,9 @@

+#!/usr/bin/env bash
+set -euo pipefail
+export DATA_DIR="${DATA_DIR:-/tmp/medrag_data}"
+export HF_HOME="${HF_HOME:-/tmp/hf_cache}"
+export PREFETCH_MODEL="${PREFETCH_MODEL:-1}"
+python download_assets.py
+exec streamlit run app.py --server.port 7860 --server.address 0.0.0.0

test_visual_search.py ADDED Viewed

	@@ -0,0 +1,308 @@

+"""
+test_visual_search.py
+─────────────────────
+Unit + integration tests for the gallery builder pipeline.
+Run:
+    # Fast unit tests (no model needed):
+    pytest test_visual_search.py -v -m "not integration"
+    # Full integration test (requires built index):
+    pytest test_visual_search.py -v --index_dir ./index --image_dir ./data
+"""
+import json
+import tempfile
+import numpy as np
+import pytest
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+from PIL import Image
+# ── Fixtures ───────────────────────────────────────────────────────────────────
+@pytest.fixture
+def dummy_index_dir(tmp_path):
+    """Create a minimal fake FAISS index + metadata for unit tests."""
+    import faiss
+    d = 512
+    n = 20
+    embeddings = np.random.randn(n, d).astype(np.float32)
+    # L2 normalize
+    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+    embeddings /= norms
+    index = faiss.IndexFlatIP(d)
+    index.add(embeddings)
+    faiss.write_index(index, str(tmp_path / "visual_db.index"))
+    metadata = {
+        str(i): {
+            "filename": f"image_{i:04d}.png",
+            "filepath": str(tmp_path / f"image_{i:04d}.png"),
+            "labels":   "Pneumonia" if i % 3 == 0 else "No Finding",
+            "idx":      i,
+        }
+        for i in range(n)
+    }
+    with open(tmp_path / "metadata.json", "w") as f:
+        json.dump(metadata, f)
+    # Create dummy PNG files
+    for i in range(n):
+        img = Image.fromarray(np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8))
+        img.save(tmp_path / f"image_{i:04d}.png")
+    return tmp_path, embeddings
+@pytest.fixture
+def dummy_xray_image(tmp_path) -> Path:
+    """Create a fake grayscale X-ray image."""
+    img_array = np.random.randint(0, 255, (224, 224), dtype=np.uint8)
+    img = Image.fromarray(img_array, mode="L").convert("RGB")
+    path = tmp_path / "test_xray.png"
+    img.save(path)
+    return path
+# ── Unit tests ─────────────────────────────────────────────────────────────────
+class TestSearchResult:
+    def test_to_dict(self):
+        from visual_search import SearchResult
+        r = SearchResult(rank=1, idx=5, filename="img.png",
+                         filepath="/data/img.png", labels="Pneumonia",
+                         similarity=0.87654)
+        d = r.to_dict()
+        assert d["rank"] == 1
+        assert d["similarity"] == 0.8765   # rounded to 4 decimal places
+        assert d["labels"] == "Pneumonia"
+        assert "image" not in d            # PIL image not serialized
+class TestFAISSIndex:
+    """Test FAISS index properties independent of BiomedCLIP."""
+    def test_build_flat_index(self):
+        import faiss
+        d, n = 512, 100
+        emb = np.random.randn(n, d).astype(np.float32)
+        emb /= np.linalg.norm(emb, axis=1, keepdims=True)
+        index = faiss.IndexFlatIP(d)
+        index.add(emb)
+        assert index.ntotal == n
+    def test_search_returns_correct_k(self):
+        import faiss
+        d, n = 512, 50
+        emb = np.random.randn(n, d).astype(np.float32)
+        emb /= np.linalg.norm(emb, axis=1, keepdims=True)
+        index = faiss.IndexFlatIP(d)
+        index.add(emb)
+        query = emb[0:1]  # use first vector as query
+        sims, idxs = index.search(query, k=5)
+        assert sims.shape == (1, 5)
+        assert idxs.shape == (1, 5)
+        # Self-match should be first with similarity ≈ 1.0
+        assert abs(sims[0][0] - 1.0) < 1e-5
+        assert idxs[0][0] == 0
+    def test_cosine_similarity_via_dot_product(self):
+        """L2-normalized dot product = cosine similarity."""
+        import faiss
+        d = 512
+        # Two identical vectors should have similarity 1.0
+        v = np.random.randn(1, d).astype(np.float32)
+        v /= np.linalg.norm(v)
+        index = faiss.IndexFlatIP(d)
+        index.add(v)
+        sims, _ = index.search(v, k=1)
+        assert abs(sims[0][0] - 1.0) < 1e-5
+    def test_ivf_index_for_large_gallery(self):
+        """IVF index works for large galleries (>10K vectors)."""
+        import faiss
+        d, n = 512, 10_000
+        emb = np.random.randn(n, d).astype(np.float32)
+        emb /= np.linalg.norm(emb, axis=1, keepdims=True)
+        nlist = 64
+        quantizer = faiss.IndexFlatIP(d)
+        index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
+        index.train(emb)
+        index.add(emb)
+        index.nprobe = 8
+        assert index.ntotal == n
+        # Check that search still works
+        sims, idxs = index.search(emb[0:1], k=5)
+        assert idxs[0][0] == 0   # self should be top result
+class TestMetadataBuilding:
+    def test_metadata_keys(self, dummy_index_dir):
+        _, embeddings = dummy_index_dir
+        meta_path = dummy_index_dir[0] / "metadata.json"
+        with open(meta_path) as f:
+            meta = json.load(f)
+        assert "0" in meta
+        entry = meta["0"]
+        assert "filename" in entry
+        assert "filepath" in entry
+        assert "labels" in entry
+        assert "idx" in entry
+    def test_metadata_count_matches_index(self, dummy_index_dir):
+        import faiss
+        index_dir = dummy_index_dir[0]
+        index = faiss.read_index(str(index_dir / "visual_db.index"))
+        with open(index_dir / "metadata.json") as f:
+            meta = json.load(f)
+        assert index.ntotal == len(meta)
+class TestVisualSearchEngine:
+    """Tests using mocked BiomedCLIP to avoid model download."""
+    def _get_engine_with_mock_model(self, index_dir):
+        """Create engine with BiomedCLIP mocked out."""
+        from visual_search import VisualSearchEngine
+        import faiss
+        with patch("visual_search.open_clip.create_model_and_transforms") as mock_create:
+            mock_model = MagicMock()
+            mock_transform = MagicMock(return_value=MagicMock(
+                unsqueeze=lambda _: MagicMock(to=lambda _: MagicMock())
+            ))
+            mock_create.return_value = (mock_model, None, mock_transform)
+            engine = VisualSearchEngine(index_dir=index_dir, device="cpu")
+            # Mock the embed function to return a random normalized vector
+            def fake_embed(img):
+                v = np.random.randn(1, 512).astype(np.float32)
+                v /= np.linalg.norm(v, axis=1, keepdims=True)
+                return v
+            engine._embed_image = fake_embed
+            return engine
+    def test_search_returns_k_results(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        dummy_img = Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))
+        results = engine.search(dummy_img, top_k=5)
+        assert len(results) == 5
+    def test_results_sorted_by_similarity(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        dummy_img = Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))
+        results = engine.search(dummy_img, top_k=5)
+        sims = [r.similarity for r in results]
+        assert sims == sorted(sims, reverse=True)
+    def test_results_have_required_fields(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        dummy_img = Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))
+        results = engine.search(dummy_img, top_k=3)
+        for r in results:
+            assert hasattr(r, "rank")
+            assert hasattr(r, "filename")
+            assert hasattr(r, "filepath")
+            assert hasattr(r, "labels")
+            assert hasattr(r, "similarity")
+            assert 0.0 <= r.similarity <= 1.0
+    def test_ranks_are_sequential(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        dummy_img = Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))
+        results = engine.search(dummy_img, top_k=5)
+        for i, r in enumerate(results, start=1):
+            assert r.rank == i
+    def test_file_not_found_raises(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        with pytest.raises(FileNotFoundError):
+            engine.search("/nonexistent/image.png")
+    def test_batch_search(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        imgs = [
+            Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))
+            for _ in range(3)
+        ]
+        batch_results = engine.search_batch(imgs, top_k=5)
+        assert len(batch_results) == 3
+        assert all(len(r) == 5 for r in batch_results)
+    def test_get_stats(self, dummy_index_dir):
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        stats = engine.get_stats()
+        assert "total_images" in stats
+        assert stats["total_images"] == 20
+        assert stats["embed_dim"] == 512
+    def test_to_dict_serializable(self, dummy_index_dir):
+        """Search results must be JSON serializable for API responses."""
+        index_dir = dummy_index_dir[0]
+        engine = self._get_engine_with_mock_model(index_dir)
+        dummy_img = Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))
+        results = engine.search(dummy_img, top_k=3)
+        payload = [r.to_dict() for r in results]
+        assert json.dumps(payload)   # raises if not serializable
+# ── Integration tests (require real index) ─────────────────────────────────────
+@pytest.mark.integration
+class TestIntegration:
+    """Run with: pytest -m integration --index_dir ./index --image_dir ./data"""
+    @pytest.fixture(autouse=True)
+    def setup(self, request):
+        self.index_dir = Path(request.config.getoption("--index_dir", default="./index"))
+        self.image_dir = Path(request.config.getoption("--image_dir", default="./data"))
+    def test_real_search(self):
+        from visual_search import VisualSearchEngine
+        engine = VisualSearchEngine(self.index_dir, device="cpu")
+        stats = engine.get_stats()
+        assert stats["total_images"] > 0
+        print(f"\nIndex contains {stats['total_images']:,} images")
+    def test_search_with_real_image(self):
+        from visual_search import VisualSearchEngine
+        engine = VisualSearchEngine(self.index_dir, device="cpu")
+        # Find first image in data dir
+        images = list(self.image_dir.rglob("*.png"))[:1]
+        if not images:
+            pytest.skip("No test images found")
+        results = engine.search(images[0], top_k=5, exclude_perfect_match=True)
+        assert len(results) > 0
+        assert results[0].similarity <= 1.0
+        print(f"\nTop result: {results[0].filename}  sim={results[0].similarity:.3f}")
+# ── Pytest config ──────────────────────────────────────────────────────────────
+def pytest_addoption(parser):
+    parser.addoption("--index_dir", action="store", default="./index")
+    parser.addoption("--image_dir", action="store", default="./data")

visual_search.py ADDED Viewed

	@@ -0,0 +1,358 @@

+"""
+visual_search.py
+────────────────
+Search function for the Medical X-ray RAG system.
+Input:  A chest X-ray image (file path or PIL Image or numpy array)
+Output: Top-K most similar cases from the gallery database
+This is the module imported by your web app and RAG pipeline.
+Usage:
+    from visual_search import VisualSearchEngine
+    engine = VisualSearchEngine(
+        index_dir="./index",
+        device="auto"
+    )
+    results = engine.search("./query_xray.png", top_k=5)
+    # returns List[SearchResult]
+    for r in results:
+        print(f"{r.rank}. {r.filename}  sim={r.similarity:.3f}  labels={r.labels}")
+"""
+import json
+import time
+import logging
+import numpy as np
+from pathlib import Path
+from dataclasses import dataclass, field
+from typing import Union, Optional
+import faiss
+import torch
+import open_clip
+from PIL import Image, UnidentifiedImageError
+log = logging.getLogger(__name__)
+# ── Constants (must match gallery_builder.py) ──────────────────────────────────
+BIOMEDCLIP_MODEL = "hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"
+INDEX_FILE       = "visual_db.index"
+METADATA_FILE    = "metadata.json"
+# ── Result dataclass ───────────────────────────────────────────────────────────
+@dataclass
+class SearchResult:
+    """One similar case returned by the search engine."""
+    rank:        int           # 1 = most similar
+    idx:         int           # Internal FAISS index ID
+    filename:    str           # Image filename
+    filepath:    str           # Absolute path to the image
+    labels:      str           # Diagnosis labels (from metadata)
+    similarity:  float         # Cosine similarity [0, 1]
+    image:       Optional[object] = field(default=None, repr=False)
+    # ↑ Optionally loaded PIL Image (set load_images=True in search())
+    def to_dict(self) -> dict:
+        return {
+            "rank":       self.rank,
+            "idx":        self.idx,
+            "filename":   self.filename,
+            "filepath":   self.filepath,
+            "labels":     self.labels,
+            "similarity": round(float(self.similarity), 4),
+        }
+# ── Search Engine ──────────────────────────────────────────────────────────────
+class VisualSearchEngine:
+    """
+    Thread-safe visual search engine for chest X-ray similarity retrieval.
+    Architecture:
+        Query image
+            │
+            ▼
+        BiomedCLIP vision encoder  →  512-dim embedding (L2 normalized)
+            │
+            ▼
+        FAISS IndexFlatIP          →  cosine similarity search
+            │
+            ▼
+        Top-K results + metadata
+    Attributes:
+        index_dir (Path): Directory containing visual_db.index + metadata.json
+        device    (str):  Compute device for BiomedCLIP
+        top_k     (int):  Default number of results to return
+    """
+    def __init__(
+        self,
+        index_dir: Union[str, Path],
+        device: str = "auto",
+        top_k: int = 5,
+    ):
+        self.index_dir = Path(index_dir).resolve()
+        self.top_k = top_k
+        self._model = None
+        self._transform = None
+        self._index = None
+        self._metadata: dict = {}
+        # Resolve device
+        if device == "auto":
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
+        else:
+            self.device = device
+        # Eager load
+        self._load_index()
+        self._load_model()
+        log.info(f"VisualSearchEngine ready  (index={self._index.ntotal:,} images, device={self.device})")
+    # ── Private loaders ────────────────────────────────────────────────────────
+    def _load_index(self):
+        """Load FAISS index + metadata from disk."""
+        index_path = self.index_dir / INDEX_FILE
+        meta_path  = self.index_dir / METADATA_FILE
+        if not index_path.exists():
+            raise FileNotFoundError(
+                f"FAISS index not found: {index_path}\n"
+                "Run:  python gallery_builder.py --image_dir ./data --output_dir ./index"
+            )
+        if not meta_path.exists():
+            raise FileNotFoundError(f"Metadata file not found: {meta_path}")
+        log.info(f"Loading FAISS index from {index_path}...")
+        self._index = faiss.read_index(str(index_path))
+        # For IVF indexes, set nprobe for recall/speed tradeoff
+        if hasattr(self._index, "nprobe"):
+            self._index.nprobe = 16
+        log.info(f"Index loaded  ({self._index.ntotal:,} vectors, dim={self._index.d})")
+        with open(meta_path) as f:
+            self._metadata = json.load(f)
+    def _load_model(self):
+        """Load BiomedCLIP vision encoder."""
+        log.info("Loading BiomedCLIP encoder...")
+        model, _, transform = open_clip.create_model_and_transforms(BIOMEDCLIP_MODEL)
+        self._model = model.to(self.device).eval()
+        self._transform = transform
+        log.info("BiomedCLIP loaded  ✓")
+    # ── Embedding ──────────────────────────────────────────────────────────────
+    @torch.no_grad()
+    def _embed_image(self, image: Image.Image) -> np.ndarray:
+        """
+        Encode a single PIL image → L2-normalized 512-dim embedding.
+        Returns shape (1, 512) float32 numpy array.
+        """
+        tensor = self._transform(image).unsqueeze(0).to(self.device)
+        features = self._model.encode_image(tensor)
+        features = features / features.norm(dim=-1, keepdim=True)
+        return features.cpu().numpy().astype(np.float32)
+    # ── Public API ─────────────────────────────────────────────────────────────
+    def search(
+        self,
+        query: Union[str, Path, Image.Image, np.ndarray],
+        top_k: Optional[int] = None,
+        load_images: bool = False,
+        exclude_perfect_match: bool = False,
+    ) -> list[SearchResult]:
+        """
+        Find the top-K most similar X-ray images to a query.
+        Args:
+            query:                 File path, PIL Image, or RGB numpy array
+            top_k:                 Number of results (overrides default)
+            load_images:           Load PIL Images into SearchResult.image
+            exclude_perfect_match: Skip results with similarity ≥ 0.9999
+                                   (use when query is in the gallery itself)
+        Returns:
+            List[SearchResult] ordered by descending similarity
+        """
+        t0 = time.perf_counter()
+        k = top_k or self.top_k
+        # ── Load query image ───────────────────────────────────────────────────
+        if isinstance(query, (str, Path)):
+            query_path = Path(query)
+            if not query_path.exists():
+                raise FileNotFoundError(f"Query image not found: {query_path}")
+            try:
+                img = Image.open(query_path).convert("RGB")
+            except (UnidentifiedImageError, OSError) as e:
+                raise ValueError(f"Cannot open image: {query_path}  ({e})")
+        elif isinstance(query, np.ndarray):
+            img = Image.fromarray(query.astype(np.uint8))
+        elif isinstance(query, Image.Image):
+            img = query.convert("RGB")
+        else:
+            raise TypeError(f"Unsupported query type: {type(query)}")
+        # ── Encode ─────────────────────────────────────────────────────────────
+        query_emb = self._embed_image(img)   # (1, 512)
+        # ── FAISS search ───────────────────────────────────────────────────────
+        search_k = k + 1 if exclude_perfect_match else k
+        similarities, indices = self._index.search(query_emb, search_k)
+        similarities = similarities[0]   # (k,)
+        indices      = indices[0]        # (k,)
+        # ── Build results ──────────────────────────────────────────────────────
+        results: list[SearchResult] = []
+        rank = 1
+        for sim, idx in zip(similarities, indices):
+            if idx < 0:                                    # FAISS returns -1 for empty slots
+                continue
+            if exclude_perfect_match and float(sim) >= 0.9999:
+                continue                                   # skip exact self-match
+            meta = self._metadata.get(str(idx), {})
+            filepath = meta.get("filepath", "")
+            result = SearchResult(
+                rank=rank,
+                idx=int(idx),
+                filename=meta.get("filename", f"image_{idx}"),
+                filepath=filepath,
+                labels=meta.get("labels", "Unknown"),
+                similarity=float(sim),
+            )
+            if load_images and filepath and Path(filepath).exists():
+                try:
+                    result.image = Image.open(filepath).convert("RGB")
+                except Exception:
+                    pass   # image loading is best-effort
+            results.append(result)
+            rank += 1
+            if len(results) >= k:
+                break
+        elapsed_ms = (time.perf_counter() - t0) * 1000
+        log.debug(f"Search completed in {elapsed_ms:.1f} ms  →  {len(results)} results")
+        return results
+    def search_batch(
+        self,
+        queries: list[Union[str, Path, Image.Image]],
+        top_k: Optional[int] = None,
+    ) -> list[list[SearchResult]]:
+        """
+        Batch search for multiple query images.
+        More efficient than calling search() in a loop.
+        """
+        k = top_k or self.top_k
+        embeddings = []
+        for q in queries:
+            if isinstance(q, (str, Path)):
+                img = Image.open(q).convert("RGB")
+            elif isinstance(q, np.ndarray):
+                img = Image.fromarray(q.astype(np.uint8))
+            else:
+                img = q.convert("RGB")
+            embeddings.append(self._embed_image(img)[0])
+        batch_emb = np.stack(embeddings)            # (N, 512)
+        sims_batch, idxs_batch = self._index.search(batch_emb, k)
+        all_results = []
+        for sims, idxs in zip(sims_batch, idxs_batch):
+            results = []
+            for rank, (sim, idx) in enumerate(zip(sims, idxs), start=1):
+                if idx < 0:
+                    continue
+                meta = self._metadata.get(str(idx), {})
+                results.append(SearchResult(
+                    rank=rank,
+                    idx=int(idx),
+                    filename=meta.get("filename", f"image_{idx}"),
+                    filepath=meta.get("filepath", ""),
+                    labels=meta.get("labels", "Unknown"),
+                    similarity=float(sim),
+                ))
+            all_results.append(results)
+        return all_results
+    def get_stats(self) -> dict:
+        """Return index statistics."""
+        return {
+            "total_images": self._index.ntotal,
+            "embed_dim":    self._index.d,
+            "index_type":   type(self._index).__name__,
+            "device":       self.device,
+            "index_dir":    str(self.index_dir),
+        }
+    def __repr__(self) -> str:
+        return (
+            f"VisualSearchEngine("
+            f"images={self._index.ntotal:,}, "
+            f"device={self.device}, "
+            f"index_dir={self.index_dir})"
+        )
+# ── Standalone CLI ─────────────────────────────────────────────────────────────
+def main():
+    import argparse
+    from pprint import pprint
+    parser = argparse.ArgumentParser(
+        description="Search for similar X-ray images"
+    )
+    parser.add_argument("query_image", type=Path, help="Path to query X-ray image")
+    parser.add_argument(
+        "--index_dir", type=Path, default=Path("./index"),
+        help="Directory with visual_db.index (default: ./index)"
+    )
+    parser.add_argument("--top_k", type=int, default=5)
+    parser.add_argument("--device", default="auto")
+    args = parser.parse_args()
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
+    engine = VisualSearchEngine(
+        index_dir=args.index_dir,
+        device=args.device,
+        top_k=args.top_k,
+    )
+    print(f"\n🔍 Query: {args.query_image}")
+    print("=" * 60)
+    results = engine.search(args.query_image, exclude_perfect_match=True)
+    for r in results:
+        bar = "█" * int(r.similarity * 30)
+        print(f"  #{r.rank}  {r.similarity:.3f}  {bar}")
+        print(f"       {r.filename}")
+        print(f"       Labels: {r.labels}")
+        print()
+    print(f"Index stats: {engine.get_stats()}")
+if __name__ == "__main__":
+    main()