ace-step15-endpoint / docs /deploy /AF3_NVIDIA_ENDPOINT.md
Andrew
Consolidate AF3/Qwen pipelines, endpoint templates, and setup docs
8bdd018

Deploy AF3 NVIDIA-Stack Endpoint (Space-Parity Runtime)

This path uses NVIDIA's llava stack + stage35 think adapter, which matches the quality profile of:

  • https://huggingface.co/spaces/nvidia/audio-flamingo-3

1) Create endpoint runtime repo

python scripts/hf_clone.py af3-nvidia-endpoint --repo-id YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO

This pushes:

  • handler.py
  • requirements.txt
  • README.md

from templates/hf-af3-nvidia-endpoint/.

2) Create Dedicated Endpoint

  1. Create endpoint from YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO.
  2. Set task to custom.
  3. Use a GPU instance.
  4. Add secret:
    • HF_TOKEN=hf_xxx

3) Recommended endpoint env vars

  • AF3_NV_DEFAULT_MODE=think
  • AF3_NV_LOAD_THINK=1
  • AF3_NV_LOAD_SINGLE=0
  • AF3_NV_CODE_REPO_ID=nvidia/audio-flamingo-3
  • AF3_NV_MODEL_REPO_ID=nvidia/audio-flamingo-3

4) Request shape from local scripts

Current scripts send:

{
  "inputs": {
    "prompt": "...",
    "audio_base64": "...",
    "max_new_tokens": 3200,
    "temperature": 0.2
  }
}

Optional extra flag for this endpoint:

{
  "inputs": {
    "think_mode": true
  }
}

5) Notes

  • First boot is slow because runtime deps + model artifacts must load.
  • Keep at least one warm replica if you want consistent latency.
  • This runtime is heavier than the HF-converted audio-flamingo-3-hf endpoint path.