Deploying MedGemma 27B on HuggingFace Dedicated Endpoints
This guide walks through deploying google/medgemma-27b-text-it as a
HuggingFace Dedicated Inference Endpoint, which our CDS Agent calls via an
OpenAI-compatible API.
Why HuggingFace Endpoints?
| Feature | Details |
|---|---|
| Model | google/medgemma-27b-text-it (HAI-DEF, competition-required) |
| Cost | ~$2.50/hr (1Γ A100 80 GB on AWS) |
| Scale-to-zero | Yes β no charges while idle |
| API format | OpenAI-compatible (TGI) β zero code changes |
| Setup time | ~10 minutes |
Prerequisites
- HuggingFace account with a valid payment method.
- MedGemma access β accept the gated-model terms at https://huggingface.co/google/medgemma-27b-text-it. You must agree to Google's Health AI Developer Foundations (HAI-DEF) license.
- A HuggingFace token with
readscope (already in.envasHF_TOKEN).
Step-by-step Deployment
1. Create the endpoint
Model Repository:
google/medgemma-27b-text-itCloud Provider: AWS (cheapest) or GCP
Region:
us-east-1(AWS) orus-central1(GCP)Instance type: GPU β 1Γ NVIDIA A100 80 GB
- AWS: ~$2.50/hr
- GCP: ~$3.60/hr
Container type: Text Generation Inference (TGI) β this is the default.
Advanced Settings:
- Max Input Length:
12288(default 4096 is too small for synthesis prompts) - Max Total Tokens:
16384 - Quantization:
none(bfloat16 fits in 80 GB) - Scale-to-zero: Enable (idle timeout: 15 min recommended)
Note: The default TGI
MAX_INPUT_TOKENS=4096will cause 422 errors on longer pipeline prompts (especially synthesis). We found12288/16384to be sufficient for all 6 pipeline steps.- Max Input Length:
Click Create Endpoint.
2. Wait for the endpoint to become ready
The first deployment downloads the model weights (~54 GB) and starts the TGI
server. This typically takes 5β15 minutes. The status will change from
Initializing β Running.
3. Configure the CDS Agent
Edit src/backend/.env:
MEDGEMMA_API_KEY=hf_YOUR_TOKEN_HERE
MEDGEMMA_BASE_URL=https://YOUR_ENDPOINT_ID.us-east-1.aws.endpoints.huggingface.cloud/v1
MEDGEMMA_MODEL_ID=tgi
MEDGEMMA_API_KEY: Your HuggingFace token (same asHF_TOKEN).MEDGEMMA_BASE_URL: The endpoint URL from the HF dashboard, with/v1appended. Example:https://x1y2z3.us-east-1.aws.endpoints.huggingface.cloud/v1MEDGEMMA_MODEL_ID: Usetgiβ TGI exposes the model under this name by default. Alternatively, you can use the full model namegoogle/medgemma-27b-text-it.
4. Verify the connection
cd src/backend
python -c "
import asyncio
from app.services.medgemma import MedGemmaService
async def test():
svc = MedGemmaService()
r = await svc.generate('What is the differential diagnosis for chest pain?')
print(r[:200])
asyncio.run(test())
"
You should see a clinical response from MedGemma.
5. Run validation
cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
Cost Estimation
| Scenario | Hours | Cost |
|---|---|---|
| Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
| Development / debugging (4 hrs) | ~4 hrs | ~$10 |
| Competition demo recording | ~1 hr | ~$2.50 |
| Total estimated | ~7 hrs | ~$17.50 |
With scale-to-zero enabled, the endpoint automatically shuts down after 15 min of inactivity β no overnight charges.
Troubleshooting
Cold start latency
After scaling to zero, the first request takes 5β15 min while the model reloads. Send a warm-up request before benchmarking.
403 Forbidden
Your HF token may not have access to the gated model. Verify at https://huggingface.co/google/medgemma-27b-text-it that your account has been granted access.
Out of memory
If the endpoint fails to start, ensure you selected the 80 GB A100, not the 40 GB variant. MedGemma 27B in bfloat16 requires ~54 GB VRAM.
"model not found" error
TGI exposes the model as tgi by default. If you get a model-not-found error,
try setting MEDGEMMA_MODEL_ID=google/medgemma-27b-text-it or check the
endpoint's /v1/models route.
Deleting the Endpoint
When you're done, delete the endpoint from the HF dashboard to stop all charges:
- Go to https://ui.endpoints.huggingface.co/
- Select your endpoint β Settings β Delete
Comparison with Alternatives
| Platform | GPU | $/hr | Scale-to-Zero | Code Changes | Setup |
|---|---|---|---|---|---|
| HF Endpoints | 1Γ A100 80 GB | $2.50 | Yes | None | Easy |
| Vertex AI | a2-ultragpu-1g | $5.78 | No | Medium | Medium |
| AWS EC2 (g5.12xlarge) | 4Γ A10G 96 GB | $5.67 | No (manual) | High | Hard |
| AWS EC2 (p4de.24xlarge) | 8Γ A100 80 GB | $27.45 | No (manual) | High | Hard |