Question on tokenizer input sampling rate & channels

#2
by Breeze-Zero - opened

Hi,
When testing the demo, I noticed audio with different sampling rates (e.g., 16 kHz, 44.1 kHz, 48 kHz) and both mono/stereo runs without errors. Is a fixed sampling rate or channel count expected for best reconstruction quality? If not, does the pipeline resample/downmix internally—and do different SR/channel settings affect the results?
Thanks!

inclusionAI org

Hi, thanks for the great question. Here’s a breakdown of the behavior:

  • Optimal Input: The model is trained on 16 kHz mono data. This is the recommended input format for the best results.
  • Different Sampling Rates: While test.py doesn't have an explicit option for resampling, the default input is 16kHz mono data. If you have a non 16kHz mono input, you should resample it first.
  • Different Channels: For stereo/multi-channel inputs, each channel is treated as a separate item in a batch. The model processes them independently and outputs a corresponding multi-channel file.
    Conclusion: To get the most accurate reconstruction, please use a 16 kHz mono input file.

Sign up or log in to comment