|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- audio |
|
|
- text-to-speech |
|
|
- tts |
|
|
- onnx |
|
|
- decoder |
|
|
- codec |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# NanoCodec Decoder - ONNX |
|
|
|
|
|
ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec. |
|
|
|
|
|
This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type:** Audio Codec Decoder |
|
|
- **Format:** ONNX (Opset 14) |
|
|
- **Input:** Token indices [batch, 4, num_frames] |
|
|
- **Output:** Audio waveform [batch, samples] @ 22050 Hz |
|
|
- **Size:** 122 MB |
|
|
- **Parameters:** ~31.5M (decoder only, 15.8% of full model) |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Configuration | Decode Time/Frame | Speedup | |
|
|
|---------------|-------------------|---------| |
|
|
| PyTorch + GPU | ~92 ms | Baseline | |
|
|
| **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ | |
|
|
| ONNX + CPU | ~60-80 ms | 1.2x faster | |
|
|
|
|
|
**Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install onnxruntime-gpu numpy |
|
|
``` |
|
|
|
|
|
For CPU-only: |
|
|
```bash |
|
|
pip install onnxruntime numpy |
|
|
``` |
|
|
|
|
|
### Usage |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
import onnxruntime as ort |
|
|
|
|
|
# Load model |
|
|
session = ort.InferenceSession( |
|
|
"nano_codec_decoder.onnx", |
|
|
providers=["CUDAExecutionProvider", "CPUExecutionProvider"] |
|
|
) |
|
|
|
|
|
# Prepare input |
|
|
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames] |
|
|
tokens_len = np.array([10], dtype=np.int64) |
|
|
|
|
|
# Run inference |
|
|
outputs = session.run( |
|
|
None, |
|
|
{"tokens": tokens, "tokens_len": tokens_len} |
|
|
) |
|
|
|
|
|
audio, audio_len = outputs |
|
|
print(f"Generated audio: {audio.shape}") # [1, 17640] samples |
|
|
``` |
|
|
|
|
|
### Integration with KaniTTS |
|
|
|
|
|
```python |
|
|
from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized |
|
|
|
|
|
# Initialize decoder |
|
|
decoder = ONNXKaniTTSDecoderOptimized( |
|
|
onnx_model_path="nano_codec_decoder.onnx", |
|
|
device="cuda" |
|
|
) |
|
|
|
|
|
# Decode frame (4 codec tokens) |
|
|
codes = [100, 200, 300, 400] |
|
|
audio = decoder.decode_frame(codes) # Returns int16 numpy array |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The decoder consists of two stages: |
|
|
|
|
|
1. **Dequantization (FSQ):** Converts token indices to latent representation |
|
|
- Input: [batch, 4, frames] → Output: [batch, 16, frames] |
|
|
|
|
|
2. **Audio Decoder (HiFiGAN):** Generates audio from latents |
|
|
- Input: [batch, 16, frames] → Output: [batch, samples] |
|
|
- Upsampling factor: ~1764x (80ms per frame at 22050 Hz) |
|
|
|
|
|
## Export Details |
|
|
|
|
|
- **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) |
|
|
- **Export Method:** PyTorch → ONNX (legacy exporter) |
|
|
- **Opset Version:** 14 |
|
|
- **Dynamic Axes:** Frame dimension and audio samples |
|
|
- **Optimizations:** Graph optimization enabled, constant folding |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Text-to-Speech Systems:** Fast neural codec decoding |
|
|
- **Real-time Audio Generation:** Sub-realtime performance on GPU |
|
|
- **Streaming TTS:** Low-latency frame-by-frame decoding |
|
|
- **KaniTTS Integration:** Drop-in replacement for PyTorch decoder |
|
|
|
|
|
## Requirements |
|
|
|
|
|
### GPU (Recommended) |
|
|
- CUDA 11.8+ or 12.x |
|
|
- cuDNN 8.x or 9.x |
|
|
- ONNX Runtime GPU: `pip install onnxruntime-gpu` |
|
|
|
|
|
### CPU |
|
|
- Any modern CPU |
|
|
- ONNX Runtime: `pip install onnxruntime` |
|
|
|
|
|
## Inputs |
|
|
|
|
|
- **tokens** (int64): Codec token indices |
|
|
- Shape: `[batch_size, 4, num_frames]` |
|
|
- Range: `[0, 499]` (FSQ codebook indices) |
|
|
|
|
|
- **tokens_len** (int64): Number of frames |
|
|
- Shape: `[batch_size]` |
|
|
- Value: Number of frames in the sequence |
|
|
|
|
|
## Outputs |
|
|
|
|
|
- **audio** (float32): Generated audio waveform |
|
|
- Shape: `[batch_size, num_samples]` |
|
|
- Range: `[-1.0, 1.0]` |
|
|
- Sample rate: 22050 Hz |
|
|
|
|
|
- **audio_len** (int64): Audio length |
|
|
- Shape: `[batch_size]` |
|
|
- Value: Number of audio samples |
|
|
|
|
|
## Accuracy |
|
|
|
|
|
Compared to PyTorch reference implementation: |
|
|
- **Mean Absolute Error:** 0.0087 |
|
|
- **Correlation:** 1.000000 (perfect) |
|
|
- **Relative Error:** 0.0006% |
|
|
|
|
|
Audio quality is virtually identical to PyTorch version. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Fixed sample rate (22050 Hz) |
|
|
- Single-channel (mono) audio only |
|
|
- Requires valid FSQ token indices (0-499 range) |
|
|
- Best performance on NVIDIA GPUs with CUDA support |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 (same as source model) |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) |
|
|
- **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en) |
|
|
- **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- NVIDIA NeMo team for the original NanoCodec |
|
|
- ONNX Runtime team for the inference engine |
|
|
- KaniTTS team for the TTS system |
|
|
|