hsoh
/

ComVo

+---
+license: mit
+tags:
+- audio
+- vocoder
+- speech
+- cvnn
+- istft
+- pytorch
+pipeline_tag: audio-to-audio
+---
+# ComVo: Complex-Valued Neural Vocoder for Waveform Generation
+**[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation**
+Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee
+- 📄 [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va)
+- 🔊 [Audio Samples](https://hs-oh-prml.github.io/ComVo/)
+- 💻 [Code Repository](https://github.com/hs-oh-prml/ComVo)
+---
+## Overview
+ComVo is a neural vocoder for waveform generation based on iSTFT.
+It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform.
+Conventional iSTFT-based vocoders typically process real and imaginary components separately.
+ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively.
+---
+## Method
+ComVo is built on the following components:
+- **Complex-domain modeling**
+  The generator and discriminator operate on complex-valued representations.
+- **Adversarial training in the complex domain**
+  The discriminator provides feedback directly on complex spectrograms.
+- **Phase quantization**
+  Phase values are discretized to regularize learning and guide phase transformation.
+- **Block-matrix computation**
+  A structured computation scheme that reduces redundant operations.
+---
+## Model Details
+- **Architecture**: GAN-based neural vocoder
+- **Representation**: Complex spectrogram
+- **Sampling rate**: 24 kHz
+- **Framework**: PyTorch ≥ 2.0
+---
+## Usage
+### Installation
+```bash
+pip install -r requirements.txt
+```
+## Inference
+```bash
+python infer.py \
+  -c configs/configs.yaml \
+  --ckpt /path/to/comvo.ckpt \
+  --wavfile /path/to/input.wav \
+  --out_dir ./results
+```
+## Training
+```bash
+python train.py -c configs/configs.yaml
+```
+Configuration details are specified in `configs/configs.yaml`.
+## Pretrained Model
+A pretrained checkpoint is provided for inference.
+- Checkpoint: https://works.do/xM2ttS4
+- Configuration: `configs/configs.yaml`
+- Sampling rate: 24 kHz
+Please ensure that the configuration file matches the checkpoint when running inference.
+---
+## Limitations
+- The model is trained for 24 kHz audio and may not generalize to other sampling rates
+- GPU is recommended for efficient inference and training
+- Complex-valued operations may not be fully supported in all deployment environments
+---
+## Citation
+```bibtex
+@inproceedings{
+  oh2026toward,
+  title={Toward Complex-Valued Neural Networks for Waveform Generation},
+  author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee},
+  booktitle={International Conference on Learning Representations (ICLR)},
+  year={2026},
+  url={https://openreview.net/forum?id=U4GXPqm3Va}
+}
+```
+## Acknowledgements
+For additional details, please refer to the paper and the project page with audio samples.