# Deploy AF3 NVIDIA-Stack Endpoint (Space-Parity Runtime) This path uses NVIDIA's `llava` stack + `stage35` think adapter, which matches the quality profile of: - `https://huggingface.co/spaces/nvidia/audio-flamingo-3` ## 1) Create endpoint runtime repo ```bash python scripts/hf_clone.py af3-nvidia-endpoint --repo-id YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO ``` This pushes: - `handler.py` - `requirements.txt` - `README.md` from `templates/hf-af3-nvidia-endpoint/`. ## 2) Create Dedicated Endpoint 1. Create endpoint from `YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO`. 2. Set task to `custom`. 3. Use a GPU instance. 4. Add secret: - `HF_TOKEN=hf_xxx` ## 3) Recommended endpoint env vars - `AF3_NV_DEFAULT_MODE=think` - `AF3_NV_LOAD_THINK=1` - `AF3_NV_LOAD_SINGLE=0` - `AF3_NV_CODE_REPO_ID=nvidia/audio-flamingo-3` - `AF3_NV_MODEL_REPO_ID=nvidia/audio-flamingo-3` ## 4) Request shape from local scripts Current scripts send: ```json { "inputs": { "prompt": "...", "audio_base64": "...", "max_new_tokens": 3200, "temperature": 0.2 } } ``` Optional extra flag for this endpoint: ```json { "inputs": { "think_mode": true } } ``` ## 5) Notes - First boot is slow because runtime deps + model artifacts must load. - Keep at least one warm replica if you want consistent latency. - This runtime is heavier than the HF-converted `audio-flamingo-3-hf` endpoint path.