File size: 1,412 Bytes
8bdd018
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Deploy AF3 NVIDIA-Stack Endpoint (Space-Parity Runtime)

This path uses NVIDIA's `llava` stack + `stage35` think adapter, which matches the quality profile of:
- `https://huggingface.co/spaces/nvidia/audio-flamingo-3`

## 1) Create endpoint runtime repo

```bash
python scripts/hf_clone.py af3-nvidia-endpoint --repo-id YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO
```

This pushes:
- `handler.py`
- `requirements.txt`
- `README.md`

from `templates/hf-af3-nvidia-endpoint/`.

## 2) Create Dedicated Endpoint

1. Create endpoint from `YOUR_USERNAME/YOUR_AF3_NVIDIA_ENDPOINT_REPO`.
2. Set task to `custom`.
3. Use a GPU instance.
4. Add secret:
   - `HF_TOKEN=hf_xxx`

## 3) Recommended endpoint env vars

- `AF3_NV_DEFAULT_MODE=think`
- `AF3_NV_LOAD_THINK=1`
- `AF3_NV_LOAD_SINGLE=0`
- `AF3_NV_CODE_REPO_ID=nvidia/audio-flamingo-3`
- `AF3_NV_MODEL_REPO_ID=nvidia/audio-flamingo-3`

## 4) Request shape from local scripts

Current scripts send:

```json
{
  "inputs": {
    "prompt": "...",
    "audio_base64": "...",
    "max_new_tokens": 3200,
    "temperature": 0.2
  }
}
```

Optional extra flag for this endpoint:

```json
{
  "inputs": {
    "think_mode": true
  }
}
```

## 5) Notes

- First boot is slow because runtime deps + model artifacts must load.
- Keep at least one warm replica if you want consistent latency.
- This runtime is heavier than the HF-converted `audio-flamingo-3-hf` endpoint path.