Question on tokenizer input sampling rate & channels
#2
by
Breeze-Zero
- opened
Hi,
When testing the demo, I noticed audio with different sampling rates (e.g., 16 kHz, 44.1 kHz, 48 kHz) and both mono/stereo runs without errors. Is a fixed sampling rate or channel count expected for best reconstruction quality? If not, does the pipeline resample/downmix internally—and do different SR/channel settings affect the results?
Thanks!
Hi, thanks for the great question. Here’s a breakdown of the behavior:
- Optimal Input: The model is trained on 16 kHz mono data. This is the recommended input format for the best results.
- Different Sampling Rates: While test.py doesn't have an explicit option for resampling, the default input is 16kHz mono data. If you have a non 16kHz mono input, you should resample it first.
- Different Channels: For stereo/multi-channel inputs, each channel is treated as a separate item in a batch. The model processes them independently and outputs a corresponding multi-channel file.
Conclusion: To get the most accurate reconstruction, please use a 16 kHz mono input file.