Update README.md

3f363c8 verified about 2 months ago

4.68 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- audio
	- text-to-speech
	- tts
	- onnx
	- decoder
	- codec
	pipeline_tag: text-to-speech
	---

	# NanoCodec Decoder - ONNX

	ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.

	This model provides 2.5x faster inference compared to the PyTorch version for KaniTTS and similar TTS systems.

	## Model Details

	- Model Type: Audio Codec Decoder
	- Format: ONNX (Opset 14)
	- Input: Token indices [batch, 4, num_frames]
	- Output: Audio waveform [batch, samples] @ 22050 Hz
	- Size: 122 MB
	- Parameters: ~31.5M (decoder only, 15.8% of full model)

	## Performance

	\| Configuration \| Decode Time/Frame \| Speedup \|
	\|---------------\|-------------------\|---------\|
	\| PyTorch + GPU \| ~92 ms \| Baseline \|
	\| ONNX + GPU \| ~35 ms \| 2.6x faster ✨ \|
	\| ONNX + CPU \| ~60-80 ms \| 1.2x faster \|

	Real-Time Factor (RTF): 0.44x on GPU (generates audio faster than playback!)

	## Quick Start

	### Installation

	```bash
	pip install onnxruntime-gpu numpy
	```

	For CPU-only:
	```bash
	pip install onnxruntime numpy
	```

	### Usage

	```python
	import numpy as np
	import onnxruntime as ort

	# Load model
	session = ort.InferenceSession(
	"nano_codec_decoder.onnx",
	providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
	)

	# Prepare input
	tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames]
	tokens_len = np.array([10], dtype=np.int64)

	# Run inference
	outputs = session.run(
	None,
	{"tokens": tokens, "tokens_len": tokens_len}
	)

	audio, audio_len = outputs
	print(f"Generated audio: {audio.shape}") # [1, 17640] samples
	```

	### Integration with KaniTTS

	```python
	from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized

	# Initialize decoder
	decoder = ONNXKaniTTSDecoderOptimized(
	onnx_model_path="nano_codec_decoder.onnx",
	device="cuda"
	)

	# Decode frame (4 codec tokens)
	codes = [100, 200, 300, 400]
	audio = decoder.decode_frame(codes) # Returns int16 numpy array
	```

	## Model Architecture

	The decoder consists of two stages:

	1. Dequantization (FSQ): Converts token indices to latent representation
	- Input: [batch, 4, frames] → Output: [batch, 16, frames]

	2. Audio Decoder (HiFiGAN): Generates audio from latents
	- Input: [batch, 16, frames] → Output: [batch, samples]
	- Upsampling factor: ~1764x (80ms per frame at 22050 Hz)

	## Export Details

	- Source Model: [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
	- Export Method: PyTorch → ONNX (legacy exporter)
	- Opset Version: 14
	- Dynamic Axes: Frame dimension and audio samples
	- Optimizations: Graph optimization enabled, constant folding

	## Use Cases

	- Text-to-Speech Systems: Fast neural codec decoding
	- Real-time Audio Generation: Sub-realtime performance on GPU
	- Streaming TTS: Low-latency frame-by-frame decoding
	- KaniTTS Integration: Drop-in replacement for PyTorch decoder

	## Requirements

	### GPU (Recommended)
	- CUDA 11.8+ or 12.x
	- cuDNN 8.x or 9.x
	- ONNX Runtime GPU: `pip install onnxruntime-gpu`

	### CPU
	- Any modern CPU
	- ONNX Runtime: `pip install onnxruntime`

	## Inputs

	- tokens (int64): Codec token indices
	- Shape: `[batch_size, 4, num_frames]`
	- Range: `[0, 499]` (FSQ codebook indices)

	- tokens_len (int64): Number of frames
	- Shape: `[batch_size]`
	- Value: Number of frames in the sequence

	## Outputs

	- audio (float32): Generated audio waveform
	- Shape: `[batch_size, num_samples]`
	- Range: `[-1.0, 1.0]`
	- Sample rate: 22050 Hz

	- audio_len (int64): Audio length
	- Shape: `[batch_size]`
	- Value: Number of audio samples

	## Accuracy

	Compared to PyTorch reference implementation:
	- Mean Absolute Error: 0.0087
	- Correlation: 1.000000 (perfect)
	- Relative Error: 0.0006%

	Audio quality is virtually identical to PyTorch version.

	## Limitations

	- Fixed sample rate (22050 Hz)
	- Single-channel (mono) audio only
	- Requires valid FSQ token indices (0-499 range)
	- Best performance on NVIDIA GPUs with CUDA support

	## License

	Apache 2.0 (same as source model)

	## Links

	- Original Model: [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
	- KaniTTS: [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
	- ONNX Runtime: [onnxruntime.ai](https://onnxruntime.ai/)

	## Acknowledgments

	- NVIDIA NeMo team for the original NanoCodec
	- ONNX Runtime team for the inference engine
	- KaniTTS team for the TTS system