| --- |
| license: mit |
| tags: |
| - audio |
| - vocoder |
| - speech |
| - cvnn |
| - istft |
| - pytorch |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # ComVo: Complex-Valued Neural Vocoder for Waveform Generation |
|
|
| **[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation** |
| Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee |
|
|
| - π [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va) |
| - π [Audio Samples](https://hs-oh-prml.github.io/ComVo/) |
| - π» [Code Repository](https://github.com/hs-oh-prml/ComVo) |
|
|
| --- |
|
|
| ## Overview |
|
|
| ComVo is a neural vocoder for waveform generation based on iSTFT. |
| It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform. |
|
|
| Conventional iSTFT-based vocoders typically process real and imaginary components separately. |
| ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively. |
|
|
| --- |
|
|
|
|
| ## Method |
|
|
| ComVo is built on the following components: |
|
|
| - **Complex-domain modeling** |
| The generator and discriminator operate on complex-valued representations. |
|
|
| - **Adversarial training in the complex domain** |
| The discriminator provides feedback directly on complex spectrograms. |
|
|
| - **Phase quantization** |
| Phase values are discretized to regularize learning and guide phase transformation. |
|
|
| - **Block-matrix computation** |
| A structured computation scheme that reduces redundant operations. |
|
|
| --- |
|
|
|
|
| ## Model Details |
|
|
| - **Architecture**: GAN-based neural vocoder |
| - **Representation**: Complex spectrogram |
| - **Sampling rate**: 24 kHz |
| - **Framework**: PyTorch β₯ 2.0 |
|
|
| --- |
|
|
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Inference |
|
|
| ```bash |
| python infer.py \ |
| -c configs/configs.yaml \ |
| --ckpt /path/to/comvo.ckpt \ |
| --wavfile /path/to/input.wav \ |
| --out_dir ./results |
| ``` |
|
|
| ## Training |
|
|
| ```bash |
| python train.py -c configs/configs.yaml |
| ``` |
| Configuration details are specified in `configs/configs.yaml`. |
|
|
| ## Pretrained Model |
|
|
| A pretrained checkpoint is provided for inference. |
|
|
| - Checkpoint: https://works.do/xM2ttS4 |
| - Configuration: `configs/configs.yaml` |
| - Sampling rate: 24 kHz |
|
|
| Please ensure that the configuration file matches the checkpoint when running inference. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - The model is trained for 24 kHz audio and may not generalize to other sampling rates |
| - GPU is recommended for efficient inference and training |
| - Complex-valued operations may not be fully supported in all deployment environments |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{ |
| oh2026toward, |
| title={Toward Complex-Valued Neural Networks for Waveform Generation}, |
| author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee}, |
| booktitle={International Conference on Learning Representations (ICLR)}, |
| year={2026}, |
| url={https://openreview.net/forum?id=U4GXPqm3Va} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| For additional details, please refer to the paper and the project page with audio samples. |