--- license: mit tags: - audio - vocoder - speech - cvnn - istft - pytorch pipeline_tag: audio-to-audio --- # ComVo: Complex-Valued Neural Vocoder for Waveform Generation **[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation** Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee - 📄 [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va) - 🔊 [Audio Samples](https://hs-oh-prml.github.io/ComVo/) - 💻 [Code Repository](https://github.com/hs-oh-prml/ComVo) --- ## Overview ComVo is a neural vocoder for waveform generation based on iSTFT. It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform. Conventional iSTFT-based vocoders typically process real and imaginary components separately. ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively. --- ## Method ComVo is built on the following components: - **Complex-domain modeling** The generator and discriminator operate on complex-valued representations. - **Adversarial training in the complex domain** The discriminator provides feedback directly on complex spectrograms. - **Phase quantization** Phase values are discretized to regularize learning and guide phase transformation. - **Block-matrix computation** A structured computation scheme that reduces redundant operations. --- ## Model Details - **Architecture**: GAN-based neural vocoder - **Representation**: Complex spectrogram - **Sampling rate**: 24 kHz - **Framework**: PyTorch ≥ 2.0 --- ## Usage ### Installation ```bash pip install -r requirements.txt ``` ## Inference ```bash python infer.py \ -c configs/configs.yaml \ --ckpt /path/to/comvo.ckpt \ --wavfile /path/to/input.wav \ --out_dir ./results ``` ## Training ```bash python train.py -c configs/configs.yaml ``` Configuration details are specified in `configs/configs.yaml`. ## Pretrained Model A pretrained checkpoint is provided for inference. - Checkpoint: https://works.do/xM2ttS4 - Configuration: `configs/configs.yaml` - Sampling rate: 24 kHz Please ensure that the configuration file matches the checkpoint when running inference. --- ## Limitations - The model is trained for 24 kHz audio and may not generalize to other sampling rates - GPU is recommended for efficient inference and training - Complex-valued operations may not be fully supported in all deployment environments --- ## Citation ```bibtex @inproceedings{ oh2026toward, title={Toward Complex-Valued Neural Networks for Waveform Generation}, author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee}, booktitle={International Conference on Learning Representations (ICLR)}, year={2026}, url={https://openreview.net/forum?id=U4GXPqm3Va} } ``` ## Acknowledgements For additional details, please refer to the paper and the project page with audio samples.