ComVo / README.md
hsoh's picture
Update README.md
fff6c6b verified
---
license: mit
tags:
- audio
- vocoder
- speech
- cvnn
- istft
- pytorch
pipeline_tag: audio-to-audio
---
# ComVo: Complex-Valued Neural Vocoder for Waveform Generation
**[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation**
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee
- πŸ“„ [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va)
- πŸ”Š [Audio Samples](https://hs-oh-prml.github.io/ComVo/)
- πŸ’» [Code Repository](https://github.com/hs-oh-prml/ComVo)
---
## Overview
ComVo is a neural vocoder for waveform generation based on iSTFT.
It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform.
Conventional iSTFT-based vocoders typically process real and imaginary components separately.
ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively.
---
## Method
ComVo is built on the following components:
- **Complex-domain modeling**
The generator and discriminator operate on complex-valued representations.
- **Adversarial training in the complex domain**
The discriminator provides feedback directly on complex spectrograms.
- **Phase quantization**
Phase values are discretized to regularize learning and guide phase transformation.
- **Block-matrix computation**
A structured computation scheme that reduces redundant operations.
---
## Model Details
- **Architecture**: GAN-based neural vocoder
- **Representation**: Complex spectrogram
- **Sampling rate**: 24 kHz
- **Framework**: PyTorch β‰₯ 2.0
---
## Usage
### Installation
```bash
pip install -r requirements.txt
```
## Inference
```bash
python infer.py \
-c configs/configs.yaml \
--ckpt /path/to/comvo.ckpt \
--wavfile /path/to/input.wav \
--out_dir ./results
```
## Training
```bash
python train.py -c configs/configs.yaml
```
Configuration details are specified in `configs/configs.yaml`.
## Pretrained Model
A pretrained checkpoint is provided for inference.
- Checkpoint: https://works.do/xM2ttS4
- Configuration: `configs/configs.yaml`
- Sampling rate: 24 kHz
Please ensure that the configuration file matches the checkpoint when running inference.
---
## Limitations
- The model is trained for 24 kHz audio and may not generalize to other sampling rates
- GPU is recommended for efficient inference and training
- Complex-valued operations may not be fully supported in all deployment environments
---
## Citation
```bibtex
@inproceedings{
oh2026toward,
title={Toward Complex-Valued Neural Networks for Waveform Generation},
author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://openreview.net/forum?id=U4GXPqm3Va}
}
```
## Acknowledgements
For additional details, please refer to the paper and the project page with audio samples.