File size: 4,955 Bytes
1f36481 eaea340 1f36481 5d53fbf 1f36481 5d53fbf 1f36481 eaea340 1f36481 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | # Deploying MedGemma 27B on HuggingFace Dedicated Endpoints
This guide walks through deploying `google/medgemma-27b-text-it` as a
HuggingFace Dedicated Inference Endpoint, which our CDS Agent calls via an
OpenAI-compatible API.
## Why HuggingFace Endpoints?
| Feature | Details |
|---|---|
| **Model** | `google/medgemma-27b-text-it` (HAI-DEF) |
| **Cost** | ~$2.50/hr (1× A100 80 GB on AWS) |
| **Scale-to-zero** | Yes — no charges while idle |
| **API format** | OpenAI-compatible (TGI) — zero code changes |
| **Setup time** | ~10 minutes |
## Prerequisites
1. **HuggingFace account** with a valid payment method.
2. **MedGemma access** — accept the gated-model terms at
<https://huggingface.co/google/medgemma-27b-text-it>. You must agree to
Google's Health AI Developer Foundations (HAI-DEF) license.
3. A **HuggingFace token** with `read` scope (already in `.env` as `HF_TOKEN`).
## Step-by-step Deployment
### 1. Create the endpoint
1. Go to <https://ui.endpoints.huggingface.co/new>.
2. **Model Repository**: `google/medgemma-27b-text-it`
3. **Cloud Provider**: AWS (cheapest) or GCP
4. **Region**: `us-east-1` (AWS) or `us-central1` (GCP)
5. **Instance type**: GPU — **1× NVIDIA A100 80 GB**
- AWS: ~$2.50/hr
- GCP: ~$3.60/hr
6. **Container type**: Text Generation Inference (TGI) — this is the default.
7. **Advanced Settings**:
- **Max Input Length**: `12288` (default 4096 is too small for synthesis prompts)
- **Max Total Tokens**: `16384`
- **Quantization**: `none` (bfloat16 fits in 80 GB)
- **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
> **Note:** The default TGI `MAX_INPUT_TOKENS=4096` will cause 422 errors
> on longer pipeline prompts (especially synthesis). We found `12288` /
> `16384` to be sufficient for all 6 pipeline steps.
8. Click **Create Endpoint**.
### 2. Wait for the endpoint to become ready
The first deployment downloads the model weights (~54 GB) and starts the TGI
server. This typically takes **5–15 minutes**. The status will change from
`Initializing` → `Running`.
### 3. Configure the CDS Agent
Edit `src/backend/.env`:
```dotenv
MEDGEMMA_API_KEY=hf_YOUR_TOKEN_HERE
MEDGEMMA_BASE_URL=https://YOUR_ENDPOINT_ID.us-east-1.aws.endpoints.huggingface.cloud/v1
MEDGEMMA_MODEL_ID=tgi
```
- **`MEDGEMMA_API_KEY`**: Your HuggingFace token (same as `HF_TOKEN`).
- **`MEDGEMMA_BASE_URL`**: The endpoint URL from the HF dashboard, with `/v1`
appended. Example:
`https://x1y2z3.us-east-1.aws.endpoints.huggingface.cloud/v1`
- **`MEDGEMMA_MODEL_ID`**: Use `tgi` — TGI exposes the model under this name
by default. Alternatively, you can use the full model name
`google/medgemma-27b-text-it`.
### 4. Verify the connection
```bash
cd src/backend
python -c "
import asyncio
from app.services.medgemma import MedGemmaService
async def test():
svc = MedGemmaService()
r = await svc.generate('What is the differential diagnosis for chest pain?')
print(r[:200])
asyncio.run(test())
"
```
You should see a clinical response from MedGemma.
### 5. Run validation
```bash
cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
```
## Cost Estimation
| Scenario | Hours | Cost |
|---|---|---|
| Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
| Development / debugging (4 hrs) | ~4 hrs | ~$10 |
| Demo recording | ~1 hr | ~$2.50 |
| **Total estimated** | **~7 hrs** | **~$17.50** |
With scale-to-zero enabled, the endpoint automatically shuts down after 15 min
of inactivity — no overnight charges.
## Troubleshooting
### Cold start latency
After scaling to zero, the first request takes 5–15 min while the model
reloads. Send a warm-up request before benchmarking.
### 403 Forbidden
Your HF token may not have access to the gated model. Verify at
<https://huggingface.co/google/medgemma-27b-text-it> that your account has been
granted access.
### Out of memory
If the endpoint fails to start, ensure you selected the **80 GB** A100, not the
40 GB variant. MedGemma 27B in bfloat16 requires ~54 GB VRAM.
### "model not found" error
TGI exposes the model as `tgi` by default. If you get a model-not-found error,
try setting `MEDGEMMA_MODEL_ID=google/medgemma-27b-text-it` or check the
endpoint's `/v1/models` route.
## Deleting the Endpoint
When you're done, delete the endpoint from the HF dashboard to stop all
charges:
1. Go to <https://ui.endpoints.huggingface.co/>
2. Select your endpoint → **Settings** → **Delete**
## Comparison with Alternatives
| Platform | GPU | $/hr | Scale-to-Zero | Code Changes | Setup |
|---|---|---|---|---|---|
| **HF Endpoints** | 1× A100 80 GB | **$2.50** | **Yes** | **None** | **Easy** |
| Vertex AI | a2-ultragpu-1g | $5.78 | No | Medium | Medium |
| AWS EC2 (g5.12xlarge) | 4× A10G 96 GB | $5.67 | No (manual) | High | Hard |
| AWS EC2 (p4de.24xlarge) | 8× A100 80 GB | $27.45 | No (manual) | High | Hard |
|