Future Suggestions

by yukiarimo - opened Oct 3, 2025

Discussion

yukiarimo

Oct 3, 2025

Bruh, 16 kHz isn’t high-finality. High-fidelity is 48 kHz. So misleading

frank20230324

inclusionAI org Oct 4, 2025

Thank you for your attention to our work. You've raised a very important point, and we appreciate you highlighting it.

You are absolutely right that in the audio community, "high-fidelity" (Hi-Fi) is standardly associated with sample rates of 44.1 kHz or 48 kHz. Our use of the term was meant to describe the reconstruction fidelity of the tokenizer—meaning its ability to faithfully and accurately reconstruct the original 16 kHz input signal, which it does with high quality. We'll be more precise in modelcard, using terms like "high-quality reconstruction" to avoid confusion.

We are currently training higher sample rate versions of the model.
Thanks again for your engagement and for helping us improve the project. Stay tuned!

Best regards,
Ming team

yukiarimo

Oct 4, 2025

•

edited Oct 5, 2025

Okay! A little request/suggestion from me for the next model:

2B and 4B sizes would be great! It’s for dense. Please make dense ones, too, not only MoE!
I tried base models, but they aren’t great. So, make an instruct version but with an additional version where you just trained on instruct multimodal data without RL and DPO
Both audio and images+video in a single model. Audio is in 48 kHz, so please do not use voice cloning, so it will be easier to run full SFT and make a catastrophic forgetting. And for vision, use SigLip with native resolution (or anything else, but custom resolution is important). Vision support probably from using multiple images at 1 fps, I guess.
Text and audio out maximum. No image and other gen, please!
VERY IMPORTANT: NO DIFFUSION!!!

Thanks!

yukiarimo changed discussion title from Losers to Future Suggestions Oct 5, 2025

frank20230324

inclusionAI org Oct 5, 2025

Thank you for sharing your detailed feedback. You've raised several important points regarding model architecture and training preferences.
We will review these suggestions as we finalize our roadmap for future versions.
Really appreciate you sharing this with us. Stay tuned for what's coming next!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment