Prasanna05 commited on
Commit
9ef210b
·
verified ·
1 Parent(s): a93b74f

Add Model card and documentation

Browse files
Files changed (1) hide show
  1. README.md +208 -0
README.md ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - audio
7
+ - text-to-speech
8
+ - tts
9
+ - onnx
10
+ - decoder
11
+ - codec
12
+ pipeline_tag: text-to-speech
13
+ ---
14
+
15
+ # NanoCodec Decoder - ONNX
16
+
17
+ ONNX-optimized decoder for the [NeMo NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) audio codec.
18
+
19
+ This model provides **2.5x faster inference** compared to the PyTorch version for KaniTTS and similar TTS systems.
20
+
21
+ ## Model Details
22
+
23
+ - **Model Type:** Audio Codec Decoder
24
+ - **Format:** ONNX (Opset 14)
25
+ - **Input:** Token indices [batch, 4, num_frames]
26
+ - **Output:** Audio waveform [batch, samples] @ 22050 Hz
27
+ - **Size:** 122 MB
28
+ - **Parameters:** ~31.5M (decoder only, 15.8% of full model)
29
+
30
+ ## Performance
31
+
32
+ | Configuration | Decode Time/Frame | Speedup |
33
+ |---------------|-------------------|---------|
34
+ | PyTorch + GPU | ~92 ms | Baseline |
35
+ | **ONNX + GPU** | **~35 ms** | **2.6x faster** ✨ |
36
+ | ONNX + CPU | ~60-80 ms | 1.2x faster |
37
+
38
+ **Real-Time Factor (RTF):** 0.44x on GPU (generates audio faster than playback!)
39
+
40
+ ## Quick Start
41
+
42
+ ### Installation
43
+
44
+ ```bash
45
+ pip install onnxruntime-gpu numpy
46
+ ```
47
+
48
+ For CPU-only:
49
+ ```bash
50
+ pip install onnxruntime numpy
51
+ ```
52
+
53
+ ### Usage
54
+
55
+ ```python
56
+ import numpy as np
57
+ import onnxruntime as ort
58
+
59
+ # Load model
60
+ session = ort.InferenceSession(
61
+ "nano_codec_decoder.onnx",
62
+ providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
63
+ )
64
+
65
+ # Prepare input
66
+ tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames]
67
+ tokens_len = np.array([10], dtype=np.int64)
68
+
69
+ # Run inference
70
+ outputs = session.run(
71
+ None,
72
+ {"tokens": tokens, "tokens_len": tokens_len}
73
+ )
74
+
75
+ audio, audio_len = outputs
76
+ print(f"Generated audio: {audio.shape}") # [1, 17640] samples
77
+ ```
78
+
79
+ ### Integration with KaniTTS
80
+
81
+ ```python
82
+ from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized
83
+
84
+ # Initialize decoder
85
+ decoder = ONNXKaniTTSDecoderOptimized(
86
+ onnx_model_path="nano_codec_decoder.onnx",
87
+ device="cuda"
88
+ )
89
+
90
+ # Decode frame (4 codec tokens)
91
+ codes = [100, 200, 300, 400]
92
+ audio = decoder.decode_frame(codes) # Returns int16 numpy array
93
+ ```
94
+
95
+ ## Model Architecture
96
+
97
+ The decoder consists of two stages:
98
+
99
+ 1. **Dequantization (FSQ):** Converts token indices to latent representation
100
+ - Input: [batch, 4, frames] → Output: [batch, 16, frames]
101
+
102
+ 2. **Audio Decoder (HiFiGAN):** Generates audio from latents
103
+ - Input: [batch, 16, frames] → Output: [batch, samples]
104
+ - Upsampling factor: ~1764x (80ms per frame at 22050 Hz)
105
+
106
+ ## Export Details
107
+
108
+ - **Source Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
109
+ - **Export Method:** PyTorch → ONNX (legacy exporter)
110
+ - **Opset Version:** 14
111
+ - **Dynamic Axes:** Frame dimension and audio samples
112
+ - **Optimizations:** Graph optimization enabled, constant folding
113
+
114
+ ## Use Cases
115
+
116
+ - **Text-to-Speech Systems:** Fast neural codec decoding
117
+ - **Real-time Audio Generation:** Sub-realtime performance on GPU
118
+ - **Streaming TTS:** Low-latency frame-by-frame decoding
119
+ - **KaniTTS Integration:** Drop-in replacement for PyTorch decoder
120
+
121
+ ## Requirements
122
+
123
+ ### GPU (Recommended)
124
+ - CUDA 11.8+ or 12.x
125
+ - cuDNN 8.x or 9.x
126
+ - ONNX Runtime GPU: `pip install onnxruntime-gpu`
127
+
128
+ ### CPU
129
+ - Any modern CPU
130
+ - ONNX Runtime: `pip install onnxruntime`
131
+
132
+ ## Inputs
133
+
134
+ - **tokens** (int64): Codec token indices
135
+ - Shape: `[batch_size, 4, num_frames]`
136
+ - Range: `[0, 499]` (FSQ codebook indices)
137
+
138
+ - **tokens_len** (int64): Number of frames
139
+ - Shape: `[batch_size]`
140
+ - Value: Number of frames in the sequence
141
+
142
+ ## Outputs
143
+
144
+ - **audio** (float32): Generated audio waveform
145
+ - Shape: `[batch_size, num_samples]`
146
+ - Range: `[-1.0, 1.0]`
147
+ - Sample rate: 22050 Hz
148
+
149
+ - **audio_len** (int64): Audio length
150
+ - Shape: `[batch_size]`
151
+ - Value: Number of audio samples
152
+
153
+ ## Accuracy
154
+
155
+ Compared to PyTorch reference implementation:
156
+ - **Mean Absolute Error:** 0.0087
157
+ - **Correlation:** 1.000000 (perfect)
158
+ - **Relative Error:** 0.0006%
159
+
160
+ Audio quality is virtually identical to PyTorch version.
161
+
162
+ ## Limitations
163
+
164
+ - Fixed sample rate (22050 Hz)
165
+ - Single-channel (mono) audio only
166
+ - Requires valid FSQ token indices (0-499 range)
167
+ - Best performance on NVIDIA GPUs with CUDA support
168
+
169
+ ## Citation
170
+
171
+ If you use this model, please cite:
172
+
173
+ ```bibtex
174
+ @misc{nano-codec-decoder-onnx,
175
+ author = {Hariprasath28},
176
+ title = {NanoCodec Decoder - ONNX},
177
+ year = {2025},
178
+ publisher = {HuggingFace},
179
+ url = {https://huggingface.co/Hariprasath28/nano-codec-decoder-onnx}
180
+ }
181
+ ```
182
+
183
+ Original NeMo NanoCodec:
184
+ ```bibtex
185
+ @misc{nemo-nano-codec,
186
+ author = {NVIDIA},
187
+ title = {NeMo NanoCodec},
188
+ year = {2024},
189
+ publisher = {HuggingFace},
190
+ url = {https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps}
191
+ }
192
+ ```
193
+
194
+ ## License
195
+
196
+ Apache 2.0 (same as source model)
197
+
198
+ ## Links
199
+
200
+ - **Original Model:** [nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps)
201
+ - **KaniTTS:** [nineninesix/kani-tts-400m-en](https://huggingface.co/nineninesix/kani-tts-400m-en)
202
+ - **ONNX Runtime:** [onnxruntime.ai](https://onnxruntime.ai/)
203
+
204
+ ## Acknowledgments
205
+
206
+ - NVIDIA NeMo team for the original NanoCodec
207
+ - ONNX Runtime team for the inference engine
208
+ - KaniTTS team for the TTS system