File size: 2,312 Bytes
5f4884c
 
e2e2f3e
 
 
 
 
 
5f4884c
 
e2e2f3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
tags:
- audio
- vocoder
- pytorch
- neural-audio
- complex-valued
library_name: pytorch
---

# ComVo: Complex-Valued Neural Vocoder

## Model description

ComVo is a complex-valued neural vocoder for waveform generation based on iSTFT.  
Unlike conventional real-valued vocoders that process real and imaginary parts separately, ComVo operates directly in the complex domain using native complex arithmetic.

This enables:
- Structured modeling of complex spectrograms
- Adversarial training in the complex domain
- Improved waveform synthesis quality

The model also introduces:
- Phase quantization for structured phase modeling
- Block-matrix computation for improved training efficiency

## Paper

**Toward Complex-Valued Neural Networks for Waveform Generation**  
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee  
ICLR 2026  

https://openreview.net/forum?id=U4GXPqm3Va

## Intended use

This model is designed for:
- Neural vocoding
- Speech synthesis pipelines (e.g., TTS)
- Audio waveform reconstruction from spectral features

### Input
- Raw waveform ([1, T]) or extracted features

### Output
- Generated waveform at 24kHz

## Usage

### Load model

```python
from hf_model import ComVoHF

model = ComVoHF.from_pretrained("hsoh/ComVo-base")
model.eval()
```

### Inference from waveform

```python
audio = model.from_waveform(wav)
```

### Inference from features
```python
features = model.build_feature_extractor()(wav)
audio = model(features)
```

## Model details 
| Model | Parameters | Sampling rate |
| ----- | ---------- | ------------- |
| Base  | 13.28M     | 24 kHz        |
| Large | 114.56M    | 24 kHz        |

## Evaluation
| Model | UTMOS ↑ | PESQ (wb) ↑ | PESQ (nb) ↑ | MRSTFT ↓ |
| ----- | ------- | ----------- | ----------- | -------- |
| Base  | 3.6744  | 3.8219      | 4.0727      | 0.8580   |
| Large | 3.7618  | 3.9993      | 4.1639      | 0.8227   |

## Resources
Paper: https://openreview.net/forum?id=U4GXPqm3Va

Demo: https://hs-oh-prml.github.io/ComVo/

Code: https://github.com/hs-oh-prml/ComVo

## Citation
```bibtex
@inproceedings{
  oh2026toward,
  title={Toward Complex-Valued Neural Networks for Waveform Generation},
  author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee},
  booktitle={ICLR},
  year={2026}
}
```