File size: 3,016 Bytes
fff6c6b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | ---
license: mit
tags:
- audio
- vocoder
- speech
- cvnn
- istft
- pytorch
pipeline_tag: audio-to-audio
---
# ComVo: Complex-Valued Neural Vocoder for Waveform Generation
**[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation**
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee
- 📄 [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va)
- 🔊 [Audio Samples](https://hs-oh-prml.github.io/ComVo/)
- 💻 [Code Repository](https://github.com/hs-oh-prml/ComVo)
---
## Overview
ComVo is a neural vocoder for waveform generation based on iSTFT.
It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform.
Conventional iSTFT-based vocoders typically process real and imaginary components separately.
ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively.
---
## Method
ComVo is built on the following components:
- **Complex-domain modeling**
The generator and discriminator operate on complex-valued representations.
- **Adversarial training in the complex domain**
The discriminator provides feedback directly on complex spectrograms.
- **Phase quantization**
Phase values are discretized to regularize learning and guide phase transformation.
- **Block-matrix computation**
A structured computation scheme that reduces redundant operations.
---
## Model Details
- **Architecture**: GAN-based neural vocoder
- **Representation**: Complex spectrogram
- **Sampling rate**: 24 kHz
- **Framework**: PyTorch ≥ 2.0
---
## Usage
### Installation
```bash
pip install -r requirements.txt
```
## Inference
```bash
python infer.py \
-c configs/configs.yaml \
--ckpt /path/to/comvo.ckpt \
--wavfile /path/to/input.wav \
--out_dir ./results
```
## Training
```bash
python train.py -c configs/configs.yaml
```
Configuration details are specified in `configs/configs.yaml`.
## Pretrained Model
A pretrained checkpoint is provided for inference.
- Checkpoint: https://works.do/xM2ttS4
- Configuration: `configs/configs.yaml`
- Sampling rate: 24 kHz
Please ensure that the configuration file matches the checkpoint when running inference.
---
## Limitations
- The model is trained for 24 kHz audio and may not generalize to other sampling rates
- GPU is recommended for efficient inference and training
- Complex-valued operations may not be fully supported in all deployment environments
---
## Citation
```bibtex
@inproceedings{
oh2026toward,
title={Toward Complex-Valued Neural Networks for Waveform Generation},
author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://openreview.net/forum?id=U4GXPqm3Va}
}
```
## Acknowledgements
For additional details, please refer to the paper and the project page with audio samples. |