Question on tokenizer input sampling rate & channels

by Breeze-Zero - opened Oct 5, 2025

Oct 5, 2025

Hi,
When testing the demo, I noticed audio with different sampling rates (e.g., 16 kHz, 44.1 kHz, 48 kHz) and both mono/stereo runs without errors. Is a fixed sampling rate or channel count expected for best reconstruction quality? If not, does the pipeline resample/downmix internally—and do different SR/channel settings affect the results?
Thanks!

yongjielv

inclusionAI org Oct 7, 2025

Hi, thanks for the great question. Here’s a breakdown of the behavior:

Optimal Input: The model is trained on 16 kHz mono data. This is the recommended input format for the best results.
Different Sampling Rates: While test.py doesn't have an explicit option for resampling, the default input is 16kHz mono data. If you have a non 16kHz mono input, you should resample it first.
Different Channels: For stereo/multi-channel inputs, each channel is treated as a separate item in a batch. The model processes them independently and outputs a corresponding multi-channel file.
Conclusion: To get the most accurate reconstruction, please use a 16 kHz mono input file.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment