Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
MedDiscover-HF Rollout Plan (Hugging Face Spaces, ZeroGPU)
1) Goals & Constraints
- Host MedDiscover as a Gradio Space using ZeroGPU; no external API models (OpenAI, etc.).
- Provide a model dropdown with OSS models tested in other Spaces:
openai/gpt-oss-20b,google/gemma-3-12b-it,deepseek-vl2-small,ibm-granite/granite-visionfamily,ibm-granite/granite-docling-258M, plus room for additions. - Keep existing pipeline: PDF ingest → chunking (MedCPT tokenizer/encoder) → FAISS retrieval → answer generation via selected OSS model.
- Must run in HF build/runtime: single
app.pyentry (orsrc/app.pywithapp_fileset),sdk: gradio, dependencies viarequirements.txt/pyproject. Persistent storage limited; assume/datavolume for cached indices.
2) Spaces-specific mechanics to carry over
- Use
@spaces.GPU()(as in gpt-oss demo) to request ZeroGPU for heavy calls; pair withdevice_map="auto"andtorch_dtype="auto"/bfloat16to fit managed GPUs. - For IBM Granite Docling/Vision: models require
use_auth_token=Trueandtrust_remote_codein some cases; respectgr.NO_RELOADguard to avoid reloading on hot-reload. - Streaming: use
TextIteratorStreamerthreaded generation (seen in gpt-oss, gemma) to keep UI responsive. - System/developer prompts: gpt-oss demo uses Harmony encoding/preprompt parsing; we can simplify to plain chat unless Harmony is desired. If kept, include
openai_harmonydependency and message rendering. - Media handling: gemma demo enforces image/video limits and uses
<image>tag counting; docling/granite demos load sample assets and draw boxes—good references for multimodal support. - README front matter (
title,sdk: gradio,sdk_version,app_file) required for Spaces config; ZeroGPU Spaces accept the same.
3) Architecture on HF
- Single
app.py(orsrc/app.py) hosting:- Model registry: map human-facing names → loader functions/config (pipeline/AutoModel, tokenizer/processor, dtype, device_map, chat template).
- Embedding and indexing: MedCPT article encoder (requires GPU). Build FAISS index into
/data/faiss_index.binwith metadata/data/doc_metadata.json; reuse across sessions if present. - Retrieval: embed query with the same encoder, FAISS search (IP for MedCPT), optional rerank (if cross-encoder feasible on available GPU; otherwise skip).
- Generation: streaming handler wrapping selected model; minimal prompt template: “Use only retrieved context; answer concisely.”
- Gradio UI: upload PDFs, process PDFs (chunk+index), dropdown for embedding model (MedCPT only here) and generator model (OSS list), sliders for k/max_tokens/temp/etc., chat box showing answer + context.
- Persistence: point caches/indices to
/data(Space persistent storage). Handle cold start by checking disk before loading.
4) Model integration plan (technical nuances)
openai/gpt-oss-20b:- Load via
pipeline("text-generation", trust_remote_code=True, device_map="auto", torch_dtype="auto"). - Optionally keep Harmony encoding &
@spaces.GPU()wrapper for generation; streaming withTextIteratorStreamer.
- Load via
google/gemma-3-12b-it:AutoProcessor.from_pretrained(..., padding_side="left"),Gemma3ForConditionalGeneration.from_pretrained(..., device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager").- Supports images/videos with
<image>tags; enforce file count limits if multimodal chat is desired.
deepseek-vl2-small(assumed similar VLM):- Inspect loader; likely
AutoProcessor+ vision-language model. Usedevice_map="auto",torch_dtype=torch.bfloat16, and streaming generation.
- Inspect loader; likely
ibm-granite/granite-vision*:- Vision-language; may need
trust_remote_code; keep sample image handling (resize/pad) patterns from demo.
- Vision-language; may need
ibm-granite/granite-docling-258M:- Uses
AutoProcessor+Idefics3ForConditionalGeneration; requiresuse_auth_token=True,torch_dtype=torch.bfloat16,device_map=device. - Handles DocTags markup and bounding boxes; we can limit to text answers (no drawing) to reduce complexity initially.
- Uses
- For all: wrap generation in a common interface that accepts (query, retrieved_chunks, params) → streamed text. Unify token stopping; cap max_new_tokens modestly to stay within ZeroGPU limits.
5) Retrieval/Chunking specifics
- Chunking: reuse existing 500/50 word overlap; keep MedCPT encoder (article + query) loaded once behind
@spaces.GPU()guard. Ensure tokenizer truncation to MAX lengths (as in current config). - FAISS: build
IndexFlatIPfor MedCPT (768-d). Convert embeddings to float32 before add. Store index/metadata under/data. - Metadata structure:
[{"doc_id": int, "filename": str, "chunk_id": int, "text": str}]. - Query flow: embed query → FAISS search k → concatenate top-k texts into prompt. Skip rerank if cross-encoder too heavy for ZeroGPU; optional flag if GPU allows.
6) Hugging Face build/runtime
- Files:
app.py(entry),requirements.txt(gradio, transformers, faiss-cpu, torch==2.3+cpu?; on ZeroGPU torch with CUDA is preinstalled), optionalruntime.txt(e.g.,python-3.11),README.mdwith HF front matter, small sample PDFs.- If MedCPT or Granite needs auth, set
HF_TOKENsecret; useuse_auth_token=Truewhere required.
- Caching: set
HF_HOME=/data/.cacheto persist model weights between restarts. - No Docker needed unless custom system deps; prefer pure pip for Spaces.
7) UI/UX sketch
- Left rail: API/token textbox (for private models if needed), PDF upload & process button, model dropdown (generator), k slider, max_tokens slider, temperature/top_p, rerank checkbox (if available).
- Right: chat box with answer; collapsible context display; status messages (index loaded/building).
- Streaming answers with stop button; show retrieval scores optionally.
8) Testing & rollout
- Dry-run locally on CPU with smallest model (e.g., granite-docling-258M) if possible; otherwise rely on HF logs.
- Deploy to a test Space (ZeroGPU), confirm model loads, index builds on uploaded PDFs, and chat returns grounded answers.
- Measure cold-start load times per model; consider pre-pinning a default lightweight model to speed startup.
9) Next steps to implement
- Scaffold
MedDiscover-HF/app.pywith common loaders, retrieval pipeline, Gradio UI, and/datapersistence. - Add
requirements.txtmirroring demos (gradio>=5.x,transformers,faiss-cpu,torch,spaces, model-specific libs likeopenai-harmony,opencv-pythonif doing video). - Wire model registry for gpt-oss-20b, gemma-3-12b-it, deepseek-vl2-small, granite-vision, granite-docling; test generation stubs with mock context.
- Integrate MedCPT embedding + FAISS; add index build/load actions and guard GPU usage with
@spaces.GPU(). - Push to HF Space; validate ZeroGPU compatibility and memory footprints; tune max token defaults per model.