Prasanna05's picture
Update README.md
3f363c8 verified
---
language:
- en
license: apache-2.0
tags:
- audio
- text-to-speech
- tts
- onnx
- decoder
- codec
pipeline_tag: text-to-speech
---
# NanoCodec Decoder - ONNX
ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.
This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems.
## Model Details
- **Model Type:** Audio Codec Decoder
- **Format:** ONNX (Opset 14)
- **Input:** Token indices [batch, 4, num_frames]
- **Output:** Audio waveform [batch, samples] @ 22050 Hz
- **Size:** 122 MB
- **Parameters:** ~31.5M (decoder only, 15.8% of full model)
## Performance
| Configuration | Decode Time/Frame | Speedup |
|---------------|-------------------|---------|
| PyTorch + GPU | ~92 ms | Baseline |
| **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ |
| ONNX + CPU | ~60-80 ms | 1.2x faster |
**Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!)
## Quick Start
### Installation
```bash
pip install onnxruntime-gpu numpy
```
For CPU-only:
```bash
pip install onnxruntime numpy
```
### Usage
```python
import numpy as np
import onnxruntime as ort
# Load model
session = ort.InferenceSession(
"nano_codec_decoder.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Prepare input
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames]
tokens_len = np.array([10], dtype=np.int64)
# Run inference
outputs = session.run(
None,
{"tokens": tokens, "tokens_len": tokens_len}
)
audio, audio_len = outputs
print(f"Generated audio: {audio.shape}") # [1, 17640] samples
```
### Integration with KaniTTS
```python
from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized
# Initialize decoder
decoder = ONNXKaniTTSDecoderOptimized(
onnx_model_path="nano_codec_decoder.onnx",
device="cuda"
)
# Decode frame (4 codec tokens)
codes = [100, 200, 300, 400]
audio = decoder.decode_frame(codes) # Returns int16 numpy array
```
## Model Architecture
The decoder consists of two stages:
1. **Dequantization (FSQ):** Converts token indices to latent representation
- Input: [batch, 4, frames] → Output: [batch, 16, frames]
2. **Audio Decoder (HiFiGAN):** Generates audio from latents
- Input: [batch, 16, frames] → Output: [batch, samples]
- Upsampling factor: ~1764x (80ms per frame at 22050 Hz)
## Export Details
- **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **Export Method:** PyTorch → ONNX (legacy exporter)
- **Opset Version:** 14
- **Dynamic Axes:** Frame dimension and audio samples
- **Optimizations:** Graph optimization enabled, constant folding
## Use Cases
- **Text-to-Speech Systems:** Fast neural codec decoding
- **Real-time Audio Generation:** Sub-realtime performance on GPU
- **Streaming TTS:** Low-latency frame-by-frame decoding
- **KaniTTS Integration:** Drop-in replacement for PyTorch decoder
## Requirements
### GPU (Recommended)
- CUDA 11.8+ or 12.x
- cuDNN 8.x or 9.x
- ONNX Runtime GPU: `pip install onnxruntime-gpu`
### CPU
- Any modern CPU
- ONNX Runtime: `pip install onnxruntime`
## Inputs
- **tokens** (int64): Codec token indices
- Shape: `[batch_size, 4, num_frames]`
- Range: `[0, 499]` (FSQ codebook indices)
- **tokens_len** (int64): Number of frames
- Shape: `[batch_size]`
- Value: Number of frames in the sequence
## Outputs
- **audio** (float32): Generated audio waveform
- Shape: `[batch_size, num_samples]`
- Range: `[-1.0, 1.0]`
- Sample rate: 22050 Hz
- **audio_len** (int64): Audio length
- Shape: `[batch_size]`
- Value: Number of audio samples
## Accuracy
Compared to PyTorch reference implementation:
- **Mean Absolute Error:** 0.0087
- **Correlation:** 1.000000 (perfect)
- **Relative Error:** 0.0006%
Audio quality is virtually identical to PyTorch version.
## Limitations
- Fixed sample rate (22050 Hz)
- Single-channel (mono) audio only
- Requires valid FSQ token indices (0-499 range)
- Best performance on NVIDIA GPUs with CUDA support
## License
Apache 2.0 (same as source model)
## Links
- **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
- **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/)
## Acknowledgments
- NVIDIA NeMo team for the original NanoCodec
- ONNX Runtime team for the inference engine
- KaniTTS team for the TTS system