File size: 1,412 Bytes
8bdd018 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | # Deploy AF3 NVIDIA-Stack Endpoint (Space-Parity Runtime)
This path uses NVIDIA's `llava` stack + `stage35` think adapter, which matches the quality profile of:
- `https://huggingface.co/spaces/nvidia/audio-flamingo-3`
## 1) Create endpoint runtime repo
```bash
python scripts/hf_clone.py af3-nvidia-endpoint --repo-id YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO
```
This pushes:
- `handler.py`
- `requirements.txt`
- `README.md`
from `templates/hf-af3-nvidia-endpoint/`.
## 2) Create Dedicated Endpoint
1. Create endpoint from `YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO`.
2. Set task to `custom`.
3. Use a GPU instance.
4. Add secret:
- `HF_TOKEN=hf_xxx`
## 3) Recommended endpoint env vars
- `AF3_NV_DEFAULT_MODE=think`
- `AF3_NV_LOAD_THINK=1`
- `AF3_NV_LOAD_SINGLE=0`
- `AF3_NV_CODE_REPO_ID=nvidia/audio-flamingo-3`
- `AF3_NV_MODEL_REPO_ID=nvidia/audio-flamingo-3`
## 4) Request shape from local scripts
Current scripts send:
```json
{
"inputs": {
"prompt": "...",
"audio_base64": "...",
"max_new_tokens": 3200,
"temperature": 0.2
}
}
```
Optional extra flag for this endpoint:
```json
{
"inputs": {
"think_mode": true
}
}
```
## 5) Notes
- First boot is slow because runtime deps + model artifacts must load.
- Keep at least one warm replica if you want consistent latency.
- This runtime is heavier than the HF-converted `audio-flamingo-3-hf` endpoint path.
|