ihebig1 commited on
Commit
3342ff9
·
verified ·
1 Parent(s): 09051b5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ tags:
4
+ - medical
5
+ - biomedical
6
+ - multimodal
7
+ - text-generation
8
+ - fp8
9
+ - quantization
10
+ - vllm
11
+ - medgemma
12
+ library_name: transformers
13
+ base_model:
14
+ - google/medgemma-27b-it
15
+ ---
16
+
17
+ # MedGemma-27B-IT-FP8-Dynamic
18
+
19
+ ## Overview
20
+ **MedGemma-27B-IT-FP8-Dynamic** is an **FP8 Dynamic–quantized** derivative of **Google’s MedGemma-27B-IT** model, optimized for high-throughput inference while preserving strong performance on medical and biomedical instruction-tuned tasks.
21
+
22
+ This version is intended for **vLLM deployment** on modern NVIDIA GPUs and follows a **safe FP8 Dynamic quantization strategy** that avoids known instability issues related to vision components.
23
+
24
+ ---
25
+
26
+ ## Base Model
27
+ - **Base model**: `google/medgemma-27b-it`
28
+ - **Architecture**: Decoder-only Transformer (instruction-tuned)
29
+ - **Domain**: Medical / Biomedical NLP
30
+ - **Modality**: Multimodal (text + vision), **text-focused FP8 quantization**
31
+
32
+ ---
33
+
34
+ ## Quantization Details
35
+ - **Method**: FP8 Dynamic
36
+ - **Tooling**: `llmcompressor`
37
+ - **Quantized layers**: Linear layers
38
+ - **Excluded components**:
39
+ - `lm_head`
40
+ - Vision tower and multimodal projection layers
41
+ (`vision_tower`, `visual`, `vision_model`, `multi_modal_projector`, etc.)
42
+
43
+ ### Rationale
44
+ - FP8 Dynamic reduces VRAM usage and improves throughput.
45
+ - Vision-related modules are intentionally excluded to avoid instability and unnecessary quantization for text-centric inference.
46
+ - The resulting model is stable and compatible with **vLLM**.
47
+
48
+ **Weights are already quantized — do not apply runtime quantization.**
49
+
50
+ ---
51
+
52
+ ## Intended Use
53
+ - Medical and biomedical instruction-following
54
+ - Clinical text summarization and analysis
55
+ - Medical RAG pipelines
56
+ - Decision-support and research assistance
57
+
58
+ ---
59
+
60
+ ## Deployment (vLLM)
61
+
62
+ ### Recommended
63
+ ```bash
64
+ vllm serve ig1/medgemma-27b-it-FP8-Dynamic \
65
+ --served-model-name medgemma-27b-it-fp8 \
66
+ --dtype auto