cds-agent / docs /deploy_medgemma_hf.md
bshepp
docs: full documentation vs reality audit
5d53fbf
# Deploying MedGemma 27B on HuggingFace Dedicated Endpoints
This guide walks through deploying `google/medgemma-27b-text-it` as a
HuggingFace Dedicated Inference Endpoint, which our CDS Agent calls via an
OpenAI-compatible API.
## Why HuggingFace Endpoints?
| Feature | Details |
|---|---|
| **Model** | `google/medgemma-27b-text-it` (HAI-DEF, competition-required) |
| **Cost** | ~$2.50/hr (1Γ— A100 80 GB on AWS) |
| **Scale-to-zero** | Yes β€” no charges while idle |
| **API format** | OpenAI-compatible (TGI) β€” zero code changes |
| **Setup time** | ~10 minutes |
## Prerequisites
1. **HuggingFace account** with a valid payment method.
2. **MedGemma access** β€” accept the gated-model terms at
<https://huggingface.co/google/medgemma-27b-text-it>. You must agree to
Google's Health AI Developer Foundations (HAI-DEF) license.
3. A **HuggingFace token** with `read` scope (already in `.env` as `HF_TOKEN`).
## Step-by-step Deployment
### 1. Create the endpoint
1. Go to <https://ui.endpoints.huggingface.co/new>.
2. **Model Repository**: `google/medgemma-27b-text-it`
3. **Cloud Provider**: AWS (cheapest) or GCP
4. **Region**: `us-east-1` (AWS) or `us-central1` (GCP)
5. **Instance type**: GPU β€” **1Γ— NVIDIA A100 80 GB**
- AWS: ~$2.50/hr
- GCP: ~$3.60/hr
6. **Container type**: Text Generation Inference (TGI) β€” this is the default.
7. **Advanced Settings**:
- **Max Input Length**: `12288` (default 4096 is too small for synthesis prompts)
- **Max Total Tokens**: `16384`
- **Quantization**: `none` (bfloat16 fits in 80 GB)
- **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
> **Note:** The default TGI `MAX_INPUT_TOKENS=4096` will cause 422 errors
> on longer pipeline prompts (especially synthesis). We found `12288` /
> `16384` to be sufficient for all 6 pipeline steps.
8. Click **Create Endpoint**.
### 2. Wait for the endpoint to become ready
The first deployment downloads the model weights (~54 GB) and starts the TGI
server. This typically takes **5–15 minutes**. The status will change from
`Initializing` β†’ `Running`.
### 3. Configure the CDS Agent
Edit `src/backend/.env`:
```dotenv
MEDGEMMA_API_KEY=hf_YOUR_TOKEN_HERE
MEDGEMMA_BASE_URL=https://YOUR_ENDPOINT_ID.us-east-1.aws.endpoints.huggingface.cloud/v1
MEDGEMMA_MODEL_ID=tgi
```
- **`MEDGEMMA_API_KEY`**: Your HuggingFace token (same as `HF_TOKEN`).
- **`MEDGEMMA_BASE_URL`**: The endpoint URL from the HF dashboard, with `/v1`
appended. Example:
`https://x1y2z3.us-east-1.aws.endpoints.huggingface.cloud/v1`
- **`MEDGEMMA_MODEL_ID`**: Use `tgi` β€” TGI exposes the model under this name
by default. Alternatively, you can use the full model name
`google/medgemma-27b-text-it`.
### 4. Verify the connection
```bash
cd src/backend
python -c "
import asyncio
from app.services.medgemma import MedGemmaService
async def test():
svc = MedGemmaService()
r = await svc.generate('What is the differential diagnosis for chest pain?')
print(r[:200])
asyncio.run(test())
"
```
You should see a clinical response from MedGemma.
### 5. Run validation
```bash
cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
```
## Cost Estimation
| Scenario | Hours | Cost |
|---|---|---|
| Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
| Development / debugging (4 hrs) | ~4 hrs | ~$10 |
| Competition demo recording | ~1 hr | ~$2.50 |
| **Total estimated** | **~7 hrs** | **~$17.50** |
With scale-to-zero enabled, the endpoint automatically shuts down after 15 min
of inactivity β€” no overnight charges.
## Troubleshooting
### Cold start latency
After scaling to zero, the first request takes 5–15 min while the model
reloads. Send a warm-up request before benchmarking.
### 403 Forbidden
Your HF token may not have access to the gated model. Verify at
<https://huggingface.co/google/medgemma-27b-text-it> that your account has been
granted access.
### Out of memory
If the endpoint fails to start, ensure you selected the **80 GB** A100, not the
40 GB variant. MedGemma 27B in bfloat16 requires ~54 GB VRAM.
### "model not found" error
TGI exposes the model as `tgi` by default. If you get a model-not-found error,
try setting `MEDGEMMA_MODEL_ID=google/medgemma-27b-text-it` or check the
endpoint's `/v1/models` route.
## Deleting the Endpoint
When you're done, delete the endpoint from the HF dashboard to stop all
charges:
1. Go to <https://ui.endpoints.huggingface.co/>
2. Select your endpoint β†’ **Settings** β†’ **Delete**
## Comparison with Alternatives
| Platform | GPU | $/hr | Scale-to-Zero | Code Changes | Setup |
|---|---|---|---|---|---|
| **HF Endpoints** | 1Γ— A100 80 GB | **$2.50** | **Yes** | **None** | **Easy** |
| Vertex AI | a2-ultragpu-1g | $5.78 | No | Medium | Medium |
| AWS EC2 (g5.12xlarge) | 4Γ— A10G 96 GB | $5.67 | No (manual) | High | Hard |
| AWS EC2 (p4de.24xlarge) | 8Γ— A100 80 GB | $27.45 | No (manual) | High | Hard |