--- language: - en license: apache-2.0 tags: - audio - text-to-speech - tts - onnx - decoder - codec pipeline_tag: text-to-speech --- # NanoCodec Decoder - ONNX ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec. This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems. ## Model Details - **Model Type:** Audio Codec Decoder - **Format:** ONNX (Opset 14) - **Input:** Token indices [batch, 4, num_frames] - **Output:** Audio waveform [batch, samples] @ 22050 Hz - **Size:** 122 MB - **Parameters:** ~31.5M (decoder only, 15.8% of full model) ## Performance | Configuration | Decode Time/Frame | Speedup | |---------------|-------------------|---------| | PyTorch + GPU | ~92 ms | Baseline | | **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ | | ONNX + CPU | ~60-80 ms | 1.2x faster | **Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!) ## Quick Start ### Installation ```bash pip install onnxruntime-gpu numpy ``` For CPU-only: ```bash pip install onnxruntime numpy ``` ### Usage ```python import numpy as np import onnxruntime as ort # Load model session = ort.InferenceSession( "nano_codec_decoder.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"] ) # Prepare input tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames] tokens_len = np.array([10], dtype=np.int64) # Run inference outputs = session.run( None, {"tokens": tokens, "tokens_len": tokens_len} ) audio, audio_len = outputs print(f"Generated audio: {audio.shape}") # [1, 17640] samples ``` ### Integration with KaniTTS ```python from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized # Initialize decoder decoder = ONNXKaniTTSDecoderOptimized( onnx_model_path="nano_codec_decoder.onnx", device="cuda" ) # Decode frame (4 codec tokens) codes = [100, 200, 300, 400] audio = decoder.decode_frame(codes) # Returns int16 numpy array ``` ## Model Architecture The decoder consists of two stages: 1. **Dequantization (FSQ):** Converts token indices to latent representation - Input: [batch, 4, frames] → Output: [batch, 16, frames] 2. **Audio Decoder (HiFiGAN):** Generates audio from latents - Input: [batch, 16, frames] → Output: [batch, samples] - Upsampling factor: ~1764x (80ms per frame at 22050 Hz) ## Export Details - **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) - **Export Method:** PyTorch → ONNX (legacy exporter) - **Opset Version:** 14 - **Dynamic Axes:** Frame dimension and audio samples - **Optimizations:** Graph optimization enabled, constant folding ## Use Cases - **Text-to-Speech Systems:** Fast neural codec decoding - **Real-time Audio Generation:** Sub-realtime performance on GPU - **Streaming TTS:** Low-latency frame-by-frame decoding - **KaniTTS Integration:** Drop-in replacement for PyTorch decoder ## Requirements ### GPU (Recommended) - CUDA 11.8+ or 12.x - cuDNN 8.x or 9.x - ONNX Runtime GPU: `pip install onnxruntime-gpu` ### CPU - Any modern CPU - ONNX Runtime: `pip install onnxruntime` ## Inputs - **tokens** (int64): Codec token indices - Shape: `[batch_size, 4, num_frames]` - Range: `[0, 499]` (FSQ codebook indices) - **tokens_len** (int64): Number of frames - Shape: `[batch_size]` - Value: Number of frames in the sequence ## Outputs - **audio** (float32): Generated audio waveform - Shape: `[batch_size, num_samples]` - Range: `[-1.0, 1.0]` - Sample rate: 22050 Hz - **audio_len** (int64): Audio length - Shape: `[batch_size]` - Value: Number of audio samples ## Accuracy Compared to PyTorch reference implementation: - **Mean Absolute Error:** 0.0087 - **Correlation:** 1.000000 (perfect) - **Relative Error:** 0.0006% Audio quality is virtually identical to PyTorch version. ## Limitations - Fixed sample rate (22050 Hz) - Single-channel (mono) audio only - Requires valid FSQ token indices (0-499 range) - Best performance on NVIDIA GPUs with CUDA support ## License Apache 2.0 (same as source model) ## Links - **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) - **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en) - **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/) ## Acknowledgments - NVIDIA NeMo team for the original NanoCodec - ONNX Runtime team for the inference engine - KaniTTS team for the TTS system