ONNX Exports
Collection
5 items
β’
Updated
β’
1
ONNX export of the DistillNeuCodec encoder for lightweight voice cloning inference.
This is an ONNX-optimized encoder that produces speech codes compatible with the NeuTTS voice cloning pipeline. The encoder extracts acoustic and semantic features from reference audio to enable zero-shot voice cloning.
This ONNX export achieves 100% identical output codes compared to the original PyTorch model across all tested audio files:
| Test File | Duration | Codes | Match |
|---|---|---|---|
dave.wav |
7.45s | 373 | β 100% |
jo.wav |
13.06s | 654 | β 100% |
nellie.wav |
7.33s | 367 | β 100% |
import numpy as np
import soundfile as sf
import onnxruntime
# Load model
sess = onnxruntime.InferenceSession("onnx/distill_neucodec_encoder.onnx")
# Load audio (must be 16kHz)
audio, sr = sf.read("reference.wav")
assert sr == 16000, f"Audio must be 16kHz, got {sr}Hz"
# IMPORTANT: Pre-pad to multiple of 320 samples
T = len(audio)
pad_for_wav = 320 - (T % 320)
audio = np.pad(audio, (0, pad_for_wav))
# Run inference
audio_input = audio[np.newaxis, np.newaxis, :].astype(np.float32)
codes = sess.run(None, {"audio": audio_input})[0].flatten().tolist()
print(f"Generated {len(codes)} codes")
| Name | Shape | Type | Description |
|---|---|---|---|
Input: audio |
[1, 1, T] |
float32 | 16kHz audio, T must be divisible by 320 |
Output: codes |
[1, 1, F] |
int32 | Speech codes, F β T/320 |
β οΈ Important: Input audio length must be padded to a multiple of 320 samples before inference:
T = len(audio)
pad_for_wav = 320 - (T % 320)
audio = np.pad(audio, (0, pad_for_wav))
This matches the behavior of the original PyTorch model's _prepare_audio() function.
onnx/
βββ distill_neucodec_encoder.onnx # ONNX model
βββ distill_neucodec_encoder.onnx.data # External weights
onnxruntime>=1.16.0
soundfile
numpy
Apache 2.0 - same as the base model.
Base model
neuphonic/distill-neucodec