Create README.md

Browse files

Files changed (1) hide show

README.md +66 -0

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+---
+license: other
+tags:
+- medical
+- biomedical
+- multimodal
+- text-generation
+- fp8
+- quantization
+- vllm
+- medgemma
+library_name: transformers
+base_model:
+- google/medgemma-27b-it
+---
+# MedGemma-27B-IT-FP8-Dynamic
+## Overview
+**MedGemma-27B-IT-FP8-Dynamic** is an **FP8 Dynamic–quantized** derivative of **Google’s MedGemma-27B-IT** model, optimized for high-throughput inference while preserving strong performance on medical and biomedical instruction-tuned tasks.
+This version is intended for **vLLM deployment** on modern NVIDIA GPUs and follows a **safe FP8 Dynamic quantization strategy** that avoids known instability issues related to vision components.
+---
+## Base Model
+- **Base model**: `google/medgemma-27b-it`
+- **Architecture**: Decoder-only Transformer (instruction-tuned)
+- **Domain**: Medical / Biomedical NLP
+- **Modality**: Multimodal (text + vision), **text-focused FP8 quantization**
+---
+## Quantization Details
+- **Method**: FP8 Dynamic
+- **Tooling**: `llmcompressor`
+- **Quantized layers**: Linear layers
+- **Excluded components**:
+  - `lm_head`
+  - Vision tower and multimodal projection layers
+    (`vision_tower`, `visual`, `vision_model`, `multi_modal_projector`, etc.)
+### Rationale
+- FP8 Dynamic reduces VRAM usage and improves throughput.
+- Vision-related modules are intentionally excluded to avoid instability and unnecessary quantization for text-centric inference.
+- The resulting model is stable and compatible with **vLLM**.
+**Weights are already quantized — do not apply runtime quantization.**
+---
+## Intended Use
+- Medical and biomedical instruction-following
+- Clinical text summarization and analysis
+- Medical RAG pipelines
+- Decision-support and research assistance
+---
+## Deployment (vLLM)
+### Recommended
+```bash
+vllm serve ig1/medgemma-27b-it-FP8-Dynamic \
+  --served-model-name medgemma-27b-it-fp8 \
+  --dtype auto