File size: 4,684 Bytes
9ef210b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
language:
- en
license: apache-2.0
tags:
- audio
- text-to-speech
- tts
- onnx
- decoder
- codec
pipeline_tag: text-to-speech
---

# NanoCodec Decoder - ONNX

ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.

This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems.

## Model Details

- **Model Type:** Audio Codec Decoder
- **Format:** ONNX (Opset 14)
- **Input:** Token indices [batch, 4, num_frames]
- **Output:** Audio waveform [batch, samples] @ 22050 Hz
- **Size:** 122 MB
- **Parameters:** ~31.5M (decoder only, 15.8% of full model)

## Performance

| Configuration | Decode Time/Frame | Speedup |
|---------------|-------------------|---------|
| PyTorch + GPU | ~92 ms | Baseline |
| **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ |
| ONNX + CPU | ~60-80 ms | 1.2x faster |

**Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!)

## Quick Start

### Installation

```bash
pip install onnxruntime-gpu numpy
```

For CPU-only:
```bash
pip install onnxruntime numpy
```

### Usage

```python
import numpy as np
import onnxruntime as ort

# Load model
session = ort.InferenceSession(
    "nano_codec_decoder.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Prepare input
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64)  # [batch, codebooks, frames]
tokens_len = np.array([10], dtype=np.int64)

# Run inference
outputs = session.run(
    None,
    {"tokens": tokens, "tokens_len": tokens_len}
)

audio, audio_len = outputs
print(f"Generated audio: {audio.shape}")  # [1, 17640] samples
```

### Integration with KaniTTS

```python
from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized

# Initialize decoder
decoder = ONNXKaniTTSDecoderOptimized(
    onnx_model_path="nano_codec_decoder.onnx",
    device="cuda"
)

# Decode frame (4 codec tokens)
codes = [100, 200, 300, 400]
audio = decoder.decode_frame(codes)  # Returns int16 numpy array
```

## Model Architecture

The decoder consists of two stages:

1. **Dequantization (FSQ):** Converts token indices to latent representation
   - Input: [batch, 4, frames] → Output: [batch, 16, frames]

2. **Audio Decoder (HiFiGAN):** Generates audio from latents
   - Input: [batch, 16, frames] → Output: [batch, samples]
   - Upsampling factor: ~1764x (80ms per frame at 22050 Hz)

## Export Details

- **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **Export Method:** PyTorch → ONNX (legacy exporter)
- **Opset Version:** 14
- **Dynamic Axes:** Frame dimension and audio samples
- **Optimizations:** Graph optimization enabled, constant folding

## Use Cases

- **Text-to-Speech Systems:** Fast neural codec decoding
- **Real-time Audio Generation:** Sub-realtime performance on GPU
- **Streaming TTS:** Low-latency frame-by-frame decoding
- **KaniTTS Integration:** Drop-in replacement for PyTorch decoder

## Requirements

### GPU (Recommended)
- CUDA 11.8+ or 12.x
- cuDNN 8.x or 9.x
- ONNX Runtime GPU: `pip install onnxruntime-gpu`

### CPU
- Any modern CPU
- ONNX Runtime: `pip install onnxruntime`

## Inputs

- **tokens** (int64): Codec token indices
  - Shape: `[batch_size, 4, num_frames]`
  - Range: `[0, 499]` (FSQ codebook indices)

- **tokens_len** (int64): Number of frames
  - Shape: `[batch_size]`
  - Value: Number of frames in the sequence

## Outputs

- **audio** (float32): Generated audio waveform
  - Shape: `[batch_size, num_samples]`
  - Range: `[-1.0, 1.0]`
  - Sample rate: 22050 Hz

- **audio_len** (int64): Audio length
  - Shape: `[batch_size]`
  - Value: Number of audio samples

## Accuracy

Compared to PyTorch reference implementation:
- **Mean Absolute Error:** 0.0087
- **Correlation:** 1.000000 (perfect)
- **Relative Error:** 0.0006%

Audio quality is virtually identical to PyTorch version.

## Limitations

- Fixed sample rate (22050 Hz)
- Single-channel (mono) audio only
- Requires valid FSQ token indices (0-499 range)
- Best performance on NVIDIA GPUs with CUDA support

## License

Apache 2.0 (same as source model)

## Links

- **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
- **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/)

## Acknowledgments

- NVIDIA NeMo team for the original NanoCodec
- ONNX Runtime team for the inference engine
- KaniTTS team for the TTS system