| | --- |
| | license: apache-2.0 |
| | --- |
| | |
| | A [SoundStream](https://arxiv.org/abs/2107.03312) decoder to reconstruct audio from a mel-spectrogram. |
| |
|
| | ## Overview |
| |
|
| | This model is a SoundStream decoder which inverts mel-spectrograms computed with the specific hyperparameters defined in the example below. This model was trained on music data and used in [Multi-instrument Music Synthesis with Spectrogram Diffusion](https://arxiv.org/abs/2206.05408) (ISMIR 2022). |
| |
|
| | A typical use-case is to simplify music generation by predicting mel-spectrograms (instead of a raw waveform), and then use this model to reconstruct audio. |
| |
|
| | If you use it, please consider citing: |
| |
|
| | ```bibtex |
| | @article{zeghidour2021soundstream, |
| | title={Soundstream: An end-to-end neural audio codec}, |
| | author={Zeghidour, Neil and Luebs, Alejandro and Omran, Ahmed and Skoglund, Jan and Tagliasacchi, Marco}, |
| | journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, |
| | volume={30}, |
| | pages={495--507}, |
| | year={2021}, |
| | publisher={IEEE} |
| | } |
| | ``` |
| |
|
| | ## Example Use |
| |
|
| | ```python |
| | from diffusers import OnnxRuntimeModel |
| | |
| | |
| | SAMPLE_RATE = 16000 |
| | N_FFT = 1024 |
| | HOP_LENGTH = 320 |
| | WIN_LENGTH = 640 |
| | N_MEL_CHANNELS = 128 |
| | MEL_FMIN = 0.0 |
| | MEL_FMAX = int(SAMPLE_RATE // 2) |
| | CLIP_VALUE_MIN = 1e-5 |
| | CLIP_VALUE_MAX = 1e8 |
| | |
| | mel = ... |
| | |
| | melgan = OnnxRuntimeModel.from_pretrained("kashif/soundstream_mel_decoder") |
| | |
| | audio = melgan(input_features=mel.astype(np.float32)) |
| | ``` |