File size: 3,016 Bytes
fff6c6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: mit
tags:
- audio
- vocoder
- speech
- cvnn
- istft
- pytorch
pipeline_tag: audio-to-audio
---

# ComVo: Complex-Valued Neural Vocoder for Waveform Generation

**[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation**  
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

- 📄 [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va)  
- 🔊 [Audio Samples](https://hs-oh-prml.github.io/ComVo/)  
- 💻 [Code Repository](https://github.com/hs-oh-prml/ComVo)

---

## Overview

ComVo is a neural vocoder for waveform generation based on iSTFT.  
It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform.

Conventional iSTFT-based vocoders typically process real and imaginary components separately.  
ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively.

---


## Method

ComVo is built on the following components:

- **Complex-domain modeling**  
  The generator and discriminator operate on complex-valued representations.

- **Adversarial training in the complex domain**  
  The discriminator provides feedback directly on complex spectrograms.

- **Phase quantization**  
  Phase values are discretized to regularize learning and guide phase transformation.

- **Block-matrix computation**  
  A structured computation scheme that reduces redundant operations.

---


## Model Details

- **Architecture**: GAN-based neural vocoder  
- **Representation**: Complex spectrogram  
- **Sampling rate**: 24 kHz  
- **Framework**: PyTorch ≥ 2.0  

---


## Usage

### Installation

```bash
pip install -r requirements.txt
```

## Inference

```bash
python infer.py \
  -c configs/configs.yaml \
  --ckpt /path/to/comvo.ckpt \
  --wavfile /path/to/input.wav \
  --out_dir ./results
```

## Training

```bash
python train.py -c configs/configs.yaml
```
Configuration details are specified in `configs/configs.yaml`.

## Pretrained Model

A pretrained checkpoint is provided for inference.

- Checkpoint: https://works.do/xM2ttS4  
- Configuration: `configs/configs.yaml`  
- Sampling rate: 24 kHz  

Please ensure that the configuration file matches the checkpoint when running inference.

---

## Limitations

- The model is trained for 24 kHz audio and may not generalize to other sampling rates  
- GPU is recommended for efficient inference and training  
- Complex-valued operations may not be fully supported in all deployment environments  

---

## Citation

```bibtex
@inproceedings{
  oh2026toward,
  title={Toward Complex-Valued Neural Networks for Waveform Generation},
  author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://openreview.net/forum?id=U4GXPqm3Va}
}
```

## Acknowledgements

For additional details, please refer to the paper and the project page with audio samples.