Prasanna05
/

nano-codec-decoder-onnx

+---
+language:
+- en
+license: apache-2.0
+tags:
+- audio
+- text-to-speech
+- tts
+- onnx
+- decoder
+- codec
+pipeline_tag: text-to-speech
+---
+# NanoCodec Decoder - ONNX
+ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.
+This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems.
+## Model Details
+- **Model Type:** Audio Codec Decoder
+- **Format:** ONNX (Opset 14)
+- **Input:** Token indices [batch, 4, num_frames]
+- **Output:** Audio waveform [batch, samples] @ 22050 Hz
+- **Size:** 122 MB
+- **Parameters:** ~31.5M (decoder only, 15.8% of full model)
+## Performance
+| Configuration | Decode Time/Frame | Speedup |
+|---------------|-------------------|---------|
+| PyTorch + GPU | ~92 ms | Baseline |
+| **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ |
+| ONNX + CPU | ~60-80 ms | 1.2x faster |
+**Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!)
+## Quick Start
+### Installation
+```bash
+pip install onnxruntime-gpu numpy
+```
+For CPU-only:
+```bash
+pip install onnxruntime numpy
+```
+### Usage
+```python
+import numpy as np
+import onnxruntime as ort
+# Load model
+session = ort.InferenceSession(
+    "nano_codec_decoder.onnx",
+    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
+)
+# Prepare input
+tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64)  # [batch, codebooks, frames]
+tokens_len = np.array([10], dtype=np.int64)
+# Run inference
+outputs = session.run(
+    None,
+    {"tokens": tokens, "tokens_len": tokens_len}
+)
+audio, audio_len = outputs
+print(f"Generated audio: {audio.shape}")  # [1, 17640] samples
+```
+### Integration with KaniTTS
+```python
+from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized
+# Initialize decoder
+decoder = ONNXKaniTTSDecoderOptimized(
+    onnx_model_path="nano_codec_decoder.onnx",
+    device="cuda"
+)
+# Decode frame (4 codec tokens)
+codes = [100, 200, 300, 400]
+audio = decoder.decode_frame(codes)  # Returns int16 numpy array
+```
+## Model Architecture
+The decoder consists of two stages:
+1. **Dequantization (FSQ):** Converts token indices to latent representation
+   - Input: [batch, 4, frames] → Output: [batch, 16, frames]
+2. **Audio Decoder (HiFiGAN):** Generates audio from latents
+   - Input: [batch, 16, frames] → Output: [batch, samples]
+   - Upsampling factor: ~1764x (80ms per frame at 22050 Hz)
+## Export Details
+- **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
+- **Export Method:** PyTorch → ONNX (legacy exporter)
+- **Opset Version:** 14
+- **Dynamic Axes:** Frame dimension and audio samples
+- **Optimizations:** Graph optimization enabled, constant folding
+## Use Cases
+- **Text-to-Speech Systems:** Fast neural codec decoding
+- **Real-time Audio Generation:** Sub-realtime performance on GPU
+- **Streaming TTS:** Low-latency frame-by-frame decoding
+- **KaniTTS Integration:** Drop-in replacement for PyTorch decoder
+## Requirements
+### GPU (Recommended)
+- CUDA 11.8+ or 12.x
+- cuDNN 8.x or 9.x
+- ONNX Runtime GPU: `pip install onnxruntime-gpu`
+### CPU
+- Any modern CPU
+- ONNX Runtime: `pip install onnxruntime`
+## Inputs
+- **tokens** (int64): Codec token indices
+  - Shape: `[batch_size, 4, num_frames]`
+  - Range: `[0, 499]` (FSQ codebook indices)
+- **tokens_len** (int64): Number of frames
+  - Shape: `[batch_size]`
+  - Value: Number of frames in the sequence
+## Outputs
+- **audio** (float32): Generated audio waveform
+  - Shape: `[batch_size, num_samples]`
+  - Range: `[-1.0, 1.0]`
+  - Sample rate: 22050 Hz
+- **audio_len** (int64): Audio length
+  - Shape: `[batch_size]`
+  - Value: Number of audio samples
+## Accuracy
+Compared to PyTorch reference implementation:
+- **Mean Absolute Error:** 0.0087
+- **Correlation:** 1.000000 (perfect)
+- **Relative Error:** 0.0006%
+Audio quality is virtually identical to PyTorch version.
+## Limitations
+- Fixed sample rate (22050 Hz)
+- Single-channel (mono) audio only
+- Requires valid FSQ token indices (0-499 range)
+- Best performance on NVIDIA GPUs with CUDA support
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{nano-codec-decoder-onnx,
+  author = {Hariprasath28},
+  title = {NanoCodec Decoder - ONNX},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/Hariprasath28/nano-codec-decoder-onnx}
+}
+```
+Original NeMo NanoCodec:
+```bibtex
+@misc{nemo-nano-codec,
+  author = {NVIDIA},
+  title = {NeMo NanoCodec},
+  year = {2024},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps}
+}
+```
+## License
+Apache 2.0 (same as source model)
+## Links
+- **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
+- **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
+- **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/)
+## Acknowledgments
+- NVIDIA NeMo team for the original NanoCodec
+- ONNX Runtime team for the inference engine
+- KaniTTS team for the TTS system