BigVGAN Vocoder for Teochew TTS

模型简介 (Model Description)

本模型是基于 BigVGANv2 架构训练的潮州话语音合成声码器，用于从梅尔频谱重建高质量音频波形。模型基于 NVIDIA BigVGAN 训练代码，从预训练权重进行微调。

(This is a BigVGAN vocoder trained for Teochew speech synthesis, designed to reconstruct high-quality audio waveforms from mel-spectrograms. The model was fine-tuned from pre-trained weights using NVIDIA's BigVGAN training code.)

预训练权重 (Pre-trained Weights)

基础模型 (Base Model): nvidia/bigvgan_v2_22khz_80band_256x

训练数据 (Training Data)

数据集 (Dataset): teochew-extLa
筛选标准 (Filter Criteria): DNSMOS > 2.8
数据量 (Data Size): 约 283,000 条音频 (~283,000 audio clips)
总时长 (Total Duration): 555+ 小时 (hours)

训练配置 (Training Configuration)

硬件环境 (Hardware):

4 × V100 GPU
Batch Size: 4

训练迭代 (Training Iterations):

Iterations: 400,000（约 5-6 个 epochs / ~5-6 epochs, very slow）

训练时长 (Training Duration): 约 10 天 (~10 days)

超参数 (Hyperparameters):

Learning Rate: 0.0001
Adam β1: 0.8
Adam β2: 0.99
LR Decay: 0.9999996
Gradient Clip Norm: 500

模型架构 (Model Architecture)

生成器配置 (Generator Config):

{
    "upsample_rates": [4, 4, 2, 2, 2, 2],
    "upsample_kernel_sizes": [8, 8, 4, 4, 4, 4],
    "upsample_initial_channel": 1536,
    "resblock": "1",
    "resblock_kernel_sizes": [3, 7, 11],
    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    
    "activation": "snakebeta",
    "snake_logscale": true,
    "use_tanh_at_final": false,
    "use_bias_at_final": false
}

判别器配置 (Discriminator Config):

{
    "use_cqtd_instead_of_mrd": true,
    "cqtd_filters": 128,
    "cqtd_max_filters": 1024,
    "cqtd_filters_scale": 1,
    "cqtd_dilations": [1, 2, 4],
    "cqtd_hop_lengths": [512, 256, 256],
    "cqtd_n_octaves": [9, 9, 9],
    "cqtd_bins_per_octaves": [24, 36, 48],
    
    "mpd_reshapes": [2, 3, 5, 7, 11],
    "use_spectral_norm": false,
    "discriminator_channel_mult": 1
}

梅尔频谱配置 (Mel-Spectrogram Config):

{
    "sampling_rate": 22050,
    "n_fft": 1024,
    "num_mels": 80,
    "hop_size": 256,
    "win_size": 1024,
    "fmin": 0,
    "fmax": null,
    
    "segment_size": 65536,
    "use_multiscale_melloss": true,
    "lambda_melloss": 15
}

使用方法 (Usage)

1. 命令行界面 (Command Line Interface)

cd bigvgan_standalone

# 单文件重构 (Single file reconstruction)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --input audio.wav --output out.wav

# 批量处理 (Batch processing)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --input a.wav b.wav c.wav --output_dir results/

# 从梅尔频谱重构 (Reconstruct from mel-spectrogram)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --mel mel.npy --output out.wav

# 从文件列表处理 (Process from file list)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --file_list test.txt --dataset_dir /data/ --output_dir results/ --max_samples 100

# CUDA kernel 加速 (CUDA kernel acceleration)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --input audio.wav --output out.wav --use_cuda_kernel

2. 作为 Python 包导入 (Python Package Import)

from bigvgan_standalone.bigvgan_api import BigVGANVocoder

# 加载 vocoder（config_dir 包含 config.json）
# Load vocoder (config_dir contains config.json)
vocoder = BigVGANVocoder(
    config_or_dir="exp/bigvgan_teochew_22khz",
    checkpoint_path="g_05330000"  # 文件名，自动拼接目录 / filename, auto-concatenated with dir
)

# 单文件重构 (Single file reconstruction)
vocoder.reconstruct_wav("input.wav", "output.wav")

# 批量重构 (Batch reconstruction)
vocoder.reconstruct_wav_batch(
    ["a.wav", "b.wav"],
    output_dir="results/"
)

# 从文件列表 (From file list)
vocoder.reconstruct_from_file_list(
    "filelists/teochew/test.txt",
    output_dir="results/",
    dataset_dir="/data/teochew_extla/",
    max_samples=100
)

# 从梅尔频谱文件 (From mel-spectrogram file)
vocoder.mel_file_to_wav("mel.npy", "output.wav")

# 底层 Tensor 级别操作 (Low-level Tensor operations)
audio = vocoder.load_wav("input.wav")    # auto-resample + normalize
mel = vocoder.compute_mel(audio)          # [1, 80, T]
recon = vocoder.mel_to_wav(mel)           # int16 np array

3. 在 Gradio / HuggingFace Space 中使用 (Use with Gradio / HuggingFace Space)

import gradio as gr
from bigvgan_standalone.bigvgan_api import BigVGANVocoder

vocoder = BigVGANVocoder("exp/bigvgan_teochew_22khz", "g_05330000")

demo = gr.Interface(
    fn=vocoder.gradio_reconstruct,
    inputs=gr.Audio(type="numpy"),
    outputs=gr.Audio(type="numpy"),
    title="BigVGAN Vocoder Demo"
)
demo.launch()

评估结果 (Evaluation Results)

使用 DNSMOS 指标对不同检查点进行评估 (DNSMOS metrics for different checkpoints):

Checkpoint	SIG	BAK	OVRL	Notes
g_05330000	3.4721	3.8557	3.1030	Best
g_05350000	3.4650	3.8570	3.0989	-
g_05190000	3.4709	3.8464	3.0971	-
g_05390000	3.4673	3.8522	3.0968	-
g_05280000	3.4732	3.8447	3.0959	-
g_05110000	3.4654	3.8473	3.0926	-
g_05210000	3.4674	3.8418	3.0924	-
g_05140000	3.4693	3.8424	3.0923	-
...	...	...	...	...

完整评估结果 (Full evaluation results, 34 checkpoints tested):

最佳性能出现在 g_05330000 (iteration 5,330,000)，OVRL 分数达到 3.1030，与 ground truth (3.1040) 非常接近。

(Best performance achieved at g_05330000 (iteration 5,330,000) with OVRL score of 3.1030, very close to ground truth at 3.1040.)

推荐使用 (Recommended): Checkpoint g_05330000，在 OVRL（整体质量 / overall quality）指标上表现最佳 (achieves the best score)。

性能对比 (Performance Comparison)

与同数据集训练的 HiFi-GAN 相比 (Compared with HiFi-GAN trained on the same dataset):

Model	DNSMOS OVRL	Training Time	Notes
BigVGAN (this)	3.1030	~10 days	Fine-tuned from pretrained
HiFi-GAN	3.0724	~8 days	Trained from scratch
Ground Truth	3.1040	-	-

BigVGAN 在音质上略优于 HiFi-GAN (+0.0306)，几乎达到原始音频质量。

(BigVGAN slightly outperforms HiFi-GAN (+0.0306), nearly matching the original audio quality.)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for panlr/BigVGAN_22khz_teochew

Base model

nvidia/bigvgan_v2_22khz_80band_256x

Finetuned

(1)

this model

Collection including panlr/BigVGAN_22khz_teochew

teochew SpeechSynthesis

Collection

潮汕话-语音合成-模型-数据-工具 • 6 items • Updated Mar 6 • 1