ace-step15-endpoint / docs /deploy /AF3_NVIDIA_ENDPOINT.md

Andrew

Consolidate AF3/Qwen pipelines, endpoint templates, and setup docs

8bdd018 about 2 months ago

1.41 kB

	# Deploy AF3 NVIDIA-Stack Endpoint (Space-Parity Runtime)

	This path uses NVIDIA's `llava` stack + `stage35` think adapter, which matches the quality profile of:
	- `https://huggingface.co/spaces/nvidia/audio-flamingo-3`

	## 1) Create endpoint runtime repo

	```bash
	python scripts/hf_clone.py af3-nvidia-endpoint --repo-id YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO
	```

	This pushes:
	- `handler.py`
	- `requirements.txt`
	- `README.md`

	from `templates/hf-af3-nvidia-endpoint/`.

	## 2) Create Dedicated Endpoint

	1. Create endpoint from `YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO`.
	2. Set task to `custom`.
	3. Use a GPU instance.
	4. Add secret:
	- `HF_TOKEN=hf_xxx`

	## 3) Recommended endpoint env vars

	- `AF3_NV_DEFAULT_MODE=think`
	- `AF3_NV_LOAD_THINK=1`
	- `AF3_NV_LOAD_SINGLE=0`
	- `AF3_NV_CODE_REPO_ID=nvidia/audio-flamingo-3`
	- `AF3_NV_MODEL_REPO_ID=nvidia/audio-flamingo-3`

	## 4) Request shape from local scripts

	Current scripts send:

	```json
	{
	"inputs": {
	"prompt": "...",
	"audio_base64": "...",
	"max_new_tokens": 3200,
	"temperature": 0.2
	}
	}
	```

	Optional extra flag for this endpoint:

	```json
	{
	"inputs": {
	"think_mode": true
	}
	}
	```

	## 5) Notes

	- First boot is slow because runtime deps + model artifacts must load.
	- Keep at least one warm replica if you want consistent latency.
	- This runtime is heavier than the HF-converted `audio-flamingo-3-hf` endpoint path.