Deploy AF3 NVIDIA-Stack Endpoint (Space-Parity Runtime)
This path uses NVIDIA's llava stack + stage35 think adapter, which matches the quality profile of:
https://huggingface.co/spaces/nvidia/audio-flamingo-3
1) Create endpoint runtime repo
python scripts/hf_clone.py af3-nvidia-endpoint --repo-id YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO
This pushes:
handler.pyrequirements.txtREADME.md
from templates/hf-af3-nvidia-endpoint/.
2) Create Dedicated Endpoint
- Create endpoint from
YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO. - Set task to
custom. - Use a GPU instance.
- Add secret:
HF_TOKEN=hf_xxx
3) Recommended endpoint env vars
AF3_NV_DEFAULT_MODE=thinkAF3_NV_LOAD_THINK=1AF3_NV_LOAD_SINGLE=0AF3_NV_CODE_REPO_ID=nvidia/audio-flamingo-3AF3_NV_MODEL_REPO_ID=nvidia/audio-flamingo-3
4) Request shape from local scripts
Current scripts send:
{
"inputs": {
"prompt": "...",
"audio_base64": "...",
"max_new_tokens": 3200,
"temperature": 0.2
}
}
Optional extra flag for this endpoint:
{
"inputs": {
"think_mode": true
}
}
5) Notes
- First boot is slow because runtime deps + model artifacts must load.
- Keep at least one warm replica if you want consistent latency.
- This runtime is heavier than the HF-converted
audio-flamingo-3-hfendpoint path.