File size: 4,684 Bytes

9ef210b

---
language:
- en
license: apache-2.0
tags:
- audio
- text-to-speech
- tts
- onnx
- decoder
- codec
pipeline_tag: text-to-speech
---

# NanoCodec Decoder - ONNX

ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.

This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems.

## Model Details

- **Model Type:** Audio Codec Decoder
- **Format:** ONNX (Opset 14)
- **Input:** Token indices [batch, 4, num_frames]
- **Output:** Audio waveform [batch, samples] @ 22050 Hz
- **Size:** 122 MB
- **Parameters:** ~31.5M (decoder only, 15.8% of full model)

## Performance

| Configuration | Decode Time/Frame | Speedup |
|---------------|-------------------|---------|
| PyTorch + GPU | ~92 ms | Baseline |
| **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ |
| ONNX + CPU | ~60-80 ms | 1.2x faster |

**Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!)

## Quick Start

### Installation

```bash
pip install onnxruntime-gpu numpy
```

For CPU-only:
```bash
pip install onnxruntime numpy
```

### Usage

```python
import numpy as np
import onnxruntime as ort

# Load model
session = ort.InferenceSession(
    "nano_codec_decoder.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Prepare input
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64)  # [batch, codebooks, frames]
tokens_len = np.array([10], dtype=np.int64)

# Run inference
outputs = session.run(
    None,
    {"tokens": tokens, "tokens_len": tokens_len}
)

audio, audio_len = outputs
print(f"Generated audio: {audio.shape}")  # [1, 17640] samples
```

### Integration with KaniTTS

```python
from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized

# Initialize decoder
decoder = ONNXKaniTTSDecoderOptimized(
    onnx_model_path="nano_codec_decoder.onnx",
    device="cuda"
)

# Decode frame (4 codec tokens)
codes = [100, 200, 300, 400]
audio = decoder.decode_frame(codes)  # Returns int16 numpy array
```

## Model Architecture

The decoder consists of two stages:

1. **Dequantization (FSQ):** Converts token indices to latent representation
   - Input: [batch, 4, frames] → Output: [batch, 16, frames]

2. **Audio Decoder (HiFiGAN):** Generates audio from latents
   - Input: [batch, 16, frames] → Output: [batch, samples]
   - Upsampling factor: ~1764x (80ms per frame at 22050 Hz)

## Export Details

- **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **Export Method:** PyTorch → ONNX (legacy exporter)
- **Opset Version:** 14
- **Dynamic Axes:** Frame dimension and audio samples
- **Optimizations:** Graph optimization enabled, constant folding

## Use Cases

- **Text-to-Speech Systems:** Fast neural codec decoding
- **Real-time Audio Generation:** Sub-realtime performance on GPU
- **Streaming TTS:** Low-latency frame-by-frame decoding
- **KaniTTS Integration:** Drop-in replacement for PyTorch decoder

## Requirements

### GPU (Recommended)
- CUDA 11.8+ or 12.x
- cuDNN 8.x or 9.x
- ONNX Runtime GPU: `pip install onnxruntime-gpu`

### CPU
- Any modern CPU
- ONNX Runtime: `pip install onnxruntime`

## Inputs

- **tokens** (int64): Codec token indices
  - Shape: `[batch_size, 4, num_frames]`
  - Range: `[0, 499]` (FSQ codebook indices)

- **tokens_len** (int64): Number of frames
  - Shape: `[batch_size]`
  - Value: Number of frames in the sequence

## Outputs

- **audio** (float32): Generated audio waveform
  - Shape: `[batch_size, num_samples]`
  - Range: `[-1.0, 1.0]`
  - Sample rate: 22050 Hz

- **audio_len** (int64): Audio length
  - Shape: `[batch_size]`
  - Value: Number of audio samples

## Accuracy

Compared to PyTorch reference implementation:
- **Mean Absolute Error:** 0.0087
- **Correlation:** 1.000000 (perfect)
- **Relative Error:** 0.0006%

Audio quality is virtually identical to PyTorch version.

## Limitations

- Fixed sample rate (22050 Hz)
- Single-channel (mono) audio only
- Requires valid FSQ token indices (0-499 range)
- Best performance on NVIDIA GPUs with CUDA support

## License

Apache 2.0 (same as source model)

## Links

- **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
- **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
- **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/)

## Acknowledgments

- NVIDIA NeMo team for the original NanoCodec
- ONNX Runtime team for the inference engine
- KaniTTS team for the TTS system