hsoh
/

ComVo

Model card Files Files and versions

ComVo / README.md

hsoh's picture

Update README.md

fff6c6b verified 6 days ago

|

history blame contribute delete

3.02 kB

	---
	license: mit
	tags:
	- audio
	- vocoder
	- speech
	- cvnn
	- istft
	- pytorch
	pipeline_tag: audio-to-audio
	---

	# ComVo: Complex-Valued Neural Vocoder for Waveform Generation

	[ICLR 2026] Toward Complex-Valued Neural Networks for Waveform Generation
	Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

	- 📄 [OpenReview Paper](https://openreview.net/forum?id=U4GXPqm3Va)
	- 🔊 [Audio Samples](https://hs-oh-prml.github.io/ComVo/)
	- 💻 [Code Repository](https://github.com/hs-oh-prml/ComVo)

	---

	## Overview

	ComVo is a neural vocoder for waveform generation based on iSTFT.
	It models complex-valued spectrograms and synthesizes waveforms via inverse short-time Fourier transform.

	Conventional iSTFT-based vocoders typically process real and imaginary components separately.
	ComVo instead operates in the complex domain, allowing the model to capture structural relationships between magnitude and phase more effectively.

	---


	## Method

	ComVo is built on the following components:

	- Complex-domain modeling
	The generator and discriminator operate on complex-valued representations.

	- Adversarial training in the complex domain
	The discriminator provides feedback directly on complex spectrograms.

	- Phase quantization
	Phase values are discretized to regularize learning and guide phase transformation.

	- Block-matrix computation
	A structured computation scheme that reduces redundant operations.

	---


	## Model Details

	- Architecture: GAN-based neural vocoder
	- Representation: Complex spectrogram
	- Sampling rate: 24 kHz
	- Framework: PyTorch ≥ 2.0

	---


	## Usage

	### Installation

	```bash
	pip install -r requirements.txt
	```

	## Inference

	```bash
	python infer.py \
	-c configs/configs.yaml \
	--ckpt /path/to/comvo.ckpt \
	--wavfile /path/to/input.wav \
	--out_dir ./results
	```

	## Training

	```bash
	python train.py -c configs/configs.yaml
	```
	Configuration details are specified in `configs/configs.yaml`.

	## Pretrained Model

	A pretrained checkpoint is provided for inference.

	- Checkpoint: https://works.do/xM2ttS4
	- Configuration: `configs/configs.yaml`
	- Sampling rate: 24 kHz

	Please ensure that the configuration file matches the checkpoint when running inference.

	---

	## Limitations

	- The model is trained for 24 kHz audio and may not generalize to other sampling rates
	- GPU is recommended for efficient inference and training
	- Complex-valued operations may not be fully supported in all deployment environments

	---

	## Citation

	```bibtex
	@inproceedings{
	oh2026toward,
	title={Toward Complex-Valued Neural Networks for Waveform Generation},
	author={Hyung-Seok Oh and Deok-Hyeon Cho and Seung-Bin Kim and Seong-Whan Lee},
	booktitle={International Conference on Learning Representations (ICLR)},
	year={2026},
	url={https://openreview.net/forum?id=U4GXPqm3Va}
	}
	```

	## Acknowledgements

	For additional details, please refer to the paper and the project page with audio samples.