MedDisover-space / PLAN.md
VatsalPatel18's picture
Publish MedDiscover-HF app
10731f0

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

MedDiscover-HF Rollout Plan (Hugging Face Spaces, ZeroGPU)

1) Goals & Constraints

  • Host MedDiscover as a Gradio Space using ZeroGPU; no external API models (OpenAI, etc.).
  • Provide a model dropdown with OSS models tested in other Spaces: openai/gpt-oss-20b, google/gemma-3-12b-it, deepseek-vl2-small, ibm-granite/granite-vision family, ibm-granite/granite-docling-258M, plus room for additions.
  • Keep existing pipeline: PDF ingest → chunking (MedCPT tokenizer/encoder) → FAISS retrieval → answer generation via selected OSS model.
  • Must run in HF build/runtime: single app.py entry (or src/app.py with app_file set), sdk: gradio, dependencies via requirements.txt/pyproject. Persistent storage limited; assume /data volume for cached indices.

2) Spaces-specific mechanics to carry over

  • Use @spaces.GPU() (as in gpt-oss demo) to request ZeroGPU for heavy calls; pair with device_map="auto" and torch_dtype="auto"/bfloat16 to fit managed GPUs.
  • For IBM Granite Docling/Vision: models require use_auth_token=True and trust_remote_code in some cases; respect gr.NO_RELOAD guard to avoid reloading on hot-reload.
  • Streaming: use TextIteratorStreamer threaded generation (seen in gpt-oss, gemma) to keep UI responsive.
  • System/developer prompts: gpt-oss demo uses Harmony encoding/preprompt parsing; we can simplify to plain chat unless Harmony is desired. If kept, include openai_harmony dependency and message rendering.
  • Media handling: gemma demo enforces image/video limits and uses <image> tag counting; docling/granite demos load sample assets and draw boxes—good references for multimodal support.
  • README front matter (title, sdk: gradio, sdk_version, app_file) required for Spaces config; ZeroGPU Spaces accept the same.

3) Architecture on HF

  • Single app.py (or src/app.py) hosting:
    • Model registry: map human-facing names → loader functions/config (pipeline/AutoModel, tokenizer/processor, dtype, device_map, chat template).
    • Embedding and indexing: MedCPT article encoder (requires GPU). Build FAISS index into /data/faiss_index.bin with metadata /data/doc_metadata.json; reuse across sessions if present.
    • Retrieval: embed query with the same encoder, FAISS search (IP for MedCPT), optional rerank (if cross-encoder feasible on available GPU; otherwise skip).
    • Generation: streaming handler wrapping selected model; minimal prompt template: “Use only retrieved context; answer concisely.”
    • Gradio UI: upload PDFs, process PDFs (chunk+index), dropdown for embedding model (MedCPT only here) and generator model (OSS list), sliders for k/max_tokens/temp/etc., chat box showing answer + context.
  • Persistence: point caches/indices to /data (Space persistent storage). Handle cold start by checking disk before loading.

4) Model integration plan (technical nuances)

  • openai/gpt-oss-20b:
    • Load via pipeline("text-generation", trust_remote_code=True, device_map="auto", torch_dtype="auto").
    • Optionally keep Harmony encoding & @spaces.GPU() wrapper for generation; streaming with TextIteratorStreamer.
  • google/gemma-3-12b-it:
    • AutoProcessor.from_pretrained(..., padding_side="left"), Gemma3ForConditionalGeneration.from_pretrained(..., device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager").
    • Supports images/videos with <image> tags; enforce file count limits if multimodal chat is desired.
  • deepseek-vl2-small (assumed similar VLM):
    • Inspect loader; likely AutoProcessor + vision-language model. Use device_map="auto", torch_dtype=torch.bfloat16, and streaming generation.
  • ibm-granite/granite-vision*:
    • Vision-language; may need trust_remote_code; keep sample image handling (resize/pad) patterns from demo.
  • ibm-granite/granite-docling-258M:
    • Uses AutoProcessor + Idefics3ForConditionalGeneration; requires use_auth_token=True, torch_dtype=torch.bfloat16, device_map=device.
    • Handles DocTags markup and bounding boxes; we can limit to text answers (no drawing) to reduce complexity initially.
  • For all: wrap generation in a common interface that accepts (query, retrieved_chunks, params) → streamed text. Unify token stopping; cap max_new_tokens modestly to stay within ZeroGPU limits.

5) Retrieval/Chunking specifics

  • Chunking: reuse existing 500/50 word overlap; keep MedCPT encoder (article + query) loaded once behind @spaces.GPU() guard. Ensure tokenizer truncation to MAX lengths (as in current config).
  • FAISS: build IndexFlatIP for MedCPT (768-d). Convert embeddings to float32 before add. Store index/metadata under /data.
  • Metadata structure: [{"doc_id": int, "filename": str, "chunk_id": int, "text": str}].
  • Query flow: embed query → FAISS search k → concatenate top-k texts into prompt. Skip rerank if cross-encoder too heavy for ZeroGPU; optional flag if GPU allows.

6) Hugging Face build/runtime

  • Files:
    • app.py (entry), requirements.txt (gradio, transformers, faiss-cpu, torch==2.3+cpu?; on ZeroGPU torch with CUDA is preinstalled), optional runtime.txt (e.g., python-3.11), README.md with HF front matter, small sample PDFs.
    • If MedCPT or Granite needs auth, set HF_TOKEN secret; use use_auth_token=True where required.
  • Caching: set HF_HOME=/data/.cache to persist model weights between restarts.
  • No Docker needed unless custom system deps; prefer pure pip for Spaces.

7) UI/UX sketch

  • Left rail: API/token textbox (for private models if needed), PDF upload & process button, model dropdown (generator), k slider, max_tokens slider, temperature/top_p, rerank checkbox (if available).
  • Right: chat box with answer; collapsible context display; status messages (index loaded/building).
  • Streaming answers with stop button; show retrieval scores optionally.

8) Testing & rollout

  • Dry-run locally on CPU with smallest model (e.g., granite-docling-258M) if possible; otherwise rely on HF logs.
  • Deploy to a test Space (ZeroGPU), confirm model loads, index builds on uploaded PDFs, and chat returns grounded answers.
  • Measure cold-start load times per model; consider pre-pinning a default lightweight model to speed startup.

9) Next steps to implement

  1. Scaffold MedDiscover-HF/app.py with common loaders, retrieval pipeline, Gradio UI, and /data persistence.
  2. Add requirements.txt mirroring demos (gradio>=5.x, transformers, faiss-cpu, torch, spaces, model-specific libs like openai-harmony, opencv-python if doing video).
  3. Wire model registry for gpt-oss-20b, gemma-3-12b-it, deepseek-vl2-small, granite-vision, granite-docling; test generation stubs with mock context.
  4. Integrate MedCPT embedding + FAISS; add index build/load actions and guard GPU usage with @spaces.GPU().
  5. Push to HF Space; validate ZeroGPU compatibility and memory footprints; tune max token defaults per model.