Spaces:

VatsalPatel18
/

MedDisover-space

Sleeping

Publish MedDiscover-HF app

10731f0 5 months ago

6.92 kB

A newer version of the Gradio SDK is available: 6.14.0

MedDiscover-HF Rollout Plan (Hugging Face Spaces, ZeroGPU)

Host MedDiscover as a Gradio Space using ZeroGPU; no external API models (OpenAI, etc.).
Provide a model dropdown with OSS models tested in other Spaces: openai/gpt-oss-20b, google/gemma-3-12b-it, deepseek-vl2-small, ibm-granite/granite-vision family, ibm-granite/granite-docling-258M, plus room for additions.
Keep existing pipeline: PDF ingest → chunking (MedCPT tokenizer/encoder) → FAISS retrieval → answer generation via selected OSS model.
Must run in HF build/runtime: single app.py entry (or src/app.py with app_file set), sdk: gradio, dependencies via requirements.txt/pyproject. Persistent storage limited; assume /data volume for cached indices.

Use @spaces.GPU() (as in gpt-oss demo) to request ZeroGPU for heavy calls; pair with device_map="auto" and torch_dtype="auto"/bfloat16 to fit managed GPUs.
For IBM Granite Docling/Vision: models require use_auth_token=True and trust_remote_code in some cases; respect gr.NO_RELOAD guard to avoid reloading on hot-reload.
Streaming: use TextIteratorStreamer threaded generation (seen in gpt-oss, gemma) to keep UI responsive.
System/developer prompts: gpt-oss demo uses Harmony encoding/preprompt parsing; we can simplify to plain chat unless Harmony is desired. If kept, include openai_harmony dependency and message rendering.
Media handling: gemma demo enforces image/video limits and uses <image> tag counting; docling/granite demos load sample assets and draw boxes—good references for multimodal support.
README front matter (title, sdk: gradio, sdk_version, app_file) required for Spaces config; ZeroGPU Spaces accept the same.

Single app.py (or src/app.py) hosting:
- Model registry: map human-facing names → loader functions/config (pipeline/AutoModel, tokenizer/processor, dtype, device_map, chat template).
- Embedding and indexing: MedCPT article encoder (requires GPU). Build FAISS index into /data/faiss_index.bin with metadata /data/doc_metadata.json; reuse across sessions if present.
- Retrieval: embed query with the same encoder, FAISS search (IP for MedCPT), optional rerank (if cross-encoder feasible on available GPU; otherwise skip).
- Generation: streaming handler wrapping selected model; minimal prompt template: “Use only retrieved context; answer concisely.”
- Gradio UI: upload PDFs, process PDFs (chunk+index), dropdown for embedding model (MedCPT only here) and generator model (OSS list), sliders for k/max_tokens/temp/etc., chat box showing answer + context.
Persistence: point caches/indices to /data (Space persistent storage). Handle cold start by checking disk before loading.

openai/gpt-oss-20b:
- Load via pipeline("text-generation", trust_remote_code=True, device_map="auto", torch_dtype="auto").
- Optionally keep Harmony encoding & @spaces.GPU() wrapper for generation; streaming with TextIteratorStreamer.
google/gemma-3-12b-it:
- AutoProcessor.from_pretrained(..., padding_side="left"), Gemma3ForConditionalGeneration.from_pretrained(..., device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager").
- Supports images/videos with <image> tags; enforce file count limits if multimodal chat is desired.
deepseek-vl2-small (assumed similar VLM):
- Inspect loader; likely AutoProcessor + vision-language model. Use device_map="auto", torch_dtype=torch.bfloat16, and streaming generation.
ibm-granite/granite-vision*:
- Vision-language; may need trust_remote_code; keep sample image handling (resize/pad) patterns from demo.
ibm-granite/granite-docling-258M:
- Uses AutoProcessor + Idefics3ForConditionalGeneration; requires use_auth_token=True, torch_dtype=torch.bfloat16, device_map=device.
- Handles DocTags markup and bounding boxes; we can limit to text answers (no drawing) to reduce complexity initially.
For all: wrap generation in a common interface that accepts (query, retrieved_chunks, params) → streamed text. Unify token stopping; cap max_new_tokens modestly to stay within ZeroGPU limits.

Chunking: reuse existing 500/50 word overlap; keep MedCPT encoder (article + query) loaded once behind @spaces.GPU() guard. Ensure tokenizer truncation to MAX lengths (as in current config).
FAISS: build IndexFlatIP for MedCPT (768-d). Convert embeddings to float32 before add. Store index/metadata under /data.
Metadata structure: [{"doc_id": int, "filename": str, "chunk_id": int, "text": str}].
Query flow: embed query → FAISS search k → concatenate top-k texts into prompt. Skip rerank if cross-encoder too heavy for ZeroGPU; optional flag if GPU allows.

Files:
- app.py (entry), requirements.txt (gradio, transformers, faiss-cpu, torch==2.3+cpu?; on ZeroGPU torch with CUDA is preinstalled), optional runtime.txt (e.g., python-3.11), README.md with HF front matter, small sample PDFs.
- If MedCPT or Granite needs auth, set HF_TOKEN secret; use use_auth_token=True where required.
Caching: set HF_HOME=/data/.cache to persist model weights between restarts.
No Docker needed unless custom system deps; prefer pure pip for Spaces.

Left rail: API/token textbox (for private models if needed), PDF upload & process button, model dropdown (generator), k slider, max_tokens slider, temperature/top_p, rerank checkbox (if available).
Right: chat box with answer; collapsible context display; status messages (index loaded/building).
Streaming answers with stop button; show retrieval scores optionally.

Dry-run locally on CPU with smallest model (e.g., granite-docling-258M) if possible; otherwise rely on HF logs.
Deploy to a test Space (ZeroGPU), confirm model loads, index builds on uploaded PDFs, and chat returns grounded answers.
Measure cold-start load times per model; consider pre-pinning a default lightweight model to speed startup.

Scaffold MedDiscover-HF/app.py with common loaders, retrieval pipeline, Gradio UI, and /data persistence.
Add requirements.txt mirroring demos (gradio>=5.x, transformers, faiss-cpu, torch, spaces, model-specific libs like openai-harmony, opencv-python if doing video).
Wire model registry for gpt-oss-20b, gemma-3-12b-it, deepseek-vl2-small, granite-vision, granite-docling; test generation stubs with mock context.
Integrate MedCPT embedding + FAISS; add index build/load actions and guard GPU usage with @spaces.GPU().
Push to HF Space; validate ZeroGPU compatibility and memory footprints; tune max token defaults per model.