File size: 4,684 Bytes
9ef210b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | ---
language:
- en
license: apache-2.0
tags:
- audio
- text-to-speech
- tts
- onnx
- decoder
- codec
pipeline_tag: text-to-speech
---
# NanoCodec Decoder - ONNX
ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.
This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems.
## Model Details
- **Model Type:** Audio Codec Decoder
- **Format:** ONNX (Opset 14)
- **Input:** Token indices [batch, 4, num_frames]
- **Output:** Audio waveform [batch, samples] @ 22050 Hz
- **Size:** 122 MB
- **Parameters:** ~31.5M (decoder only, 15.8% of full model)
## Performance
| Configuration | Decode Time/Frame | Speedup |
|---------------|-------------------|---------|
| PyTorch + GPU | ~92 ms | Baseline |
| **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ |
| ONNX + CPU | ~60-80 ms | 1.2x faster |
**Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!)
## Quick Start
### Installation
```bash
pip install onnxruntime-gpu numpy
```
For CPU-only:
```bash
pip install onnxruntime numpy
```
### Usage
```python
import numpy as np
import onnxruntime as ort
# Load model
session = ort.InferenceSession(
"nano_codec_decoder.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Prepare input
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames]
tokens_len = np.array([10], dtype=np.int64)
# Run inference
outputs = session.run(
None,
{"tokens": tokens, "tokens_len": tokens_len}
)
audio, audio_len = outputs
print(f"Generated audio: {audio.shape}") # [1, 17640] samples
```
### Integration with KaniTTS
```python
from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized
# Initialize decoder
decoder = ONNXKaniTTSDecoderOptimized(
onnx_model_path="nano_codec_decoder.onnx",
device="cuda"
)
# Decode frame (4 codec tokens)
codes = [100, 200, 300, 400]
audio = decoder.decode_frame(codes) # Returns int16 numpy array
```
## Model Architecture
The decoder consists of two stages:
1. **Dequantization (FSQ):** Converts token indices to latent representation
- Input: [batch, 4, frames] → Output: [batch, 16, frames]
2. **Audio Decoder (HiFiGAN):** Generates audio from latents
- Input: [batch, 16, frames] → Output: [batch, samples]
- Upsampling factor: ~1764x (80ms per frame at 22050 Hz)
## Export Details
- **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **Export Method:** PyTorch → ONNX (legacy exporter)
- **Opset Version:** 14
- **Dynamic Axes:** Frame dimension and audio samples
- **Optimizations:** Graph optimization enabled, constant folding
## Use Cases
- **Text-to-Speech Systems:** Fast neural codec decoding
- **Real-time Audio Generation:** Sub-realtime performance on GPU
- **Streaming TTS:** Low-latency frame-by-frame decoding
- **KaniTTS Integration:** Drop-in replacement for PyTorch decoder
## Requirements
### GPU (Recommended)
- CUDA 11.8+ or 12.x
- cuDNN 8.x or 9.x
- ONNX Runtime GPU: `pip install onnxruntime-gpu`
### CPU
- Any modern CPU
- ONNX Runtime: `pip install onnxruntime`
## Inputs
- **tokens** (int64): Codec token indices
- Shape: `[batch_size, 4, num_frames]`
- Range: `[0, 499]` (FSQ codebook indices)
- **tokens_len** (int64): Number of frames
- Shape: `[batch_size]`
- Value: Number of frames in the sequence
## Outputs
- **audio** (float32): Generated audio waveform
- Shape: `[batch_size, num_samples]`
- Range: `[-1.0, 1.0]`
- Sample rate: 22050 Hz
- **audio_len** (int64): Audio length
- Shape: `[batch_size]`
- Value: Number of audio samples
## Accuracy
Compared to PyTorch reference implementation:
- **Mean Absolute Error:** 0.0087
- **Correlation:** 1.000000 (perfect)
- **Relative Error:** 0.0006%
Audio quality is virtually identical to PyTorch version.
## Limitations
- Fixed sample rate (22050 Hz)
- Single-channel (mono) audio only
- Requires valid FSQ token indices (0-499 range)
- Best performance on NVIDIA GPUs with CUDA support
## License
Apache 2.0 (same as source model)
## Links
- **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
- **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/)
## Acknowledgments
- NVIDIA NeMo team for the original NanoCodec
- ONNX Runtime team for the inference engine
- KaniTTS team for the TTS system
|