Deploy Inference To Your Own HF Dedicated Endpoint
This guide deploys the custom handler.py inference runtime to a Hugging Face Dedicated Inference Endpoint.
Prerequisites
- Hugging Face account
HF_TOKENwith repo write access- Dedicated Endpoint access on your HF plan
1) Create/Update Your Endpoint Repo
python scripts/hf_clone.py endpoint --repo-id YOUR_USERNAME/YOUR_ENDPOINT_REPO
This uploads:
handler.pyacestep/requirements.txtpackages.txt- endpoint-specific README template
2) Create Endpoint In HF UI
- Go to Inference Endpoints -> New endpoint.
- Select your custom model repo:
YOUR_USERNAME/YOUR_ENDPOINT_REPO. - Choose GPU hardware.
- Deploy.
3) Recommended Endpoint Environment Variables
ACE_CONFIG_PATH(default:acestep-v15-sft)ACE_LM_MODEL_PATH(default:acestep-5Hz-lm-4B)ACE_LM_BACKEND(default:pt)ACE_DOWNLOAD_SOURCE(huggingfaceormodelscope)ACE_ENABLE_FALLBACK(falserecommended for strict failure visibility)
4) Test The Endpoint
Set credentials:
# Linux/macOS
export HF_TOKEN=hf_xxx
export HF_ENDPOINT_URL=https://your-endpoint-url.endpoints.huggingface.cloud
# Windows PowerShell
$env:HF_TOKEN="hf_xxx"
$env:HF_ENDPOINT_URL="https://your-endpoint-url.endpoints.huggingface.cloud"
Test with:
python scripts/endpoint/generate_interactive.pyscripts/endpoint/test.ps1
Request Contract
{
"inputs": {
"prompt": "upbeat pop rap with emotional guitar",
"lyrics": "[Verse] city lights and midnight rain",
"duration_sec": 12,
"sample_rate": 44100,
"seed": 42,
"guidance_scale": 7.0,
"steps": 50,
"use_lm": true
}
}
Cost Control
- Use scale-to-zero for idle periods.
- Pause endpoint for immediate spend stop.
- Expect cold starts when scaled to zero.