BigVGAN Vocoder for Teochew TTS

模型简介 (Model Description)

本模型是基于 BigVGANv2 架构训练的潮州话语音合成声码器,用于从梅尔频谱重建高质量音频波形。模型基于 NVIDIA BigVGAN 训练代码,从预训练权重进行微调。

(This is a BigVGAN vocoder trained for Teochew speech synthesis, designed to reconstruct high-quality audio waveforms from mel-spectrograms. The model was fine-tuned from pre-trained weights using NVIDIA's BigVGAN training code.)

预训练权重 (Pre-trained Weights)

训练数据 (Training Data)

  • 数据集 (Dataset): teochew-extLa
  • 筛选标准 (Filter Criteria): DNSMOS > 2.8
  • 数据量 (Data Size): 约 283,000 条音频 (~283,000 audio clips)
  • 总时长 (Total Duration): 555+ 小时 (hours)

训练配置 (Training Configuration)

硬件环境 (Hardware):

  • 4 × V100 GPU
  • Batch Size: 4

训练迭代 (Training Iterations):

  • Iterations: 400,000(约 5-6 个 epochs / ~5-6 epochs, very slow)

训练时长 (Training Duration): 约 10 天 (~10 days)

超参数 (Hyperparameters):

Learning Rate: 0.0001
Adam β1: 0.8
Adam β2: 0.99
LR Decay: 0.9999996
Gradient Clip Norm: 500

模型架构 (Model Architecture)

生成器配置 (Generator Config):

{
    "upsample_rates": [4, 4, 2, 2, 2, 2],
    "upsample_kernel_sizes": [8, 8, 4, 4, 4, 4],
    "upsample_initial_channel": 1536,
    "resblock": "1",
    "resblock_kernel_sizes": [3, 7, 11],
    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    
    "activation": "snakebeta",
    "snake_logscale": true,
    "use_tanh_at_final": false,
    "use_bias_at_final": false
}

判别器配置 (Discriminator Config):

{
    "use_cqtd_instead_of_mrd": true,
    "cqtd_filters": 128,
    "cqtd_max_filters": 1024,
    "cqtd_filters_scale": 1,
    "cqtd_dilations": [1, 2, 4],
    "cqtd_hop_lengths": [512, 256, 256],
    "cqtd_n_octaves": [9, 9, 9],
    "cqtd_bins_per_octaves": [24, 36, 48],
    
    "mpd_reshapes": [2, 3, 5, 7, 11],
    "use_spectral_norm": false,
    "discriminator_channel_mult": 1
}

梅尔频谱配置 (Mel-Spectrogram Config):

{
    "sampling_rate": 22050,
    "n_fft": 1024,
    "num_mels": 80,
    "hop_size": 256,
    "win_size": 1024,
    "fmin": 0,
    "fmax": null,
    
    "segment_size": 65536,
    "use_multiscale_melloss": true,
    "lambda_melloss": 15
}

使用方法 (Usage)

1. 命令行界面 (Command Line Interface)

cd bigvgan_standalone

# 单文件重构 (Single file reconstruction)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --input audio.wav --output out.wav

# 批量处理 (Batch processing)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --input a.wav b.wav c.wav --output_dir results/

# 从梅尔频谱重构 (Reconstruct from mel-spectrogram)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --mel mel.npy --output out.wav

# 从文件列表处理 (Process from file list)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --file_list test.txt --dataset_dir /data/ --output_dir results/ --max_samples 100

# CUDA kernel 加速 (CUDA kernel acceleration)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
    --input audio.wav --output out.wav --use_cuda_kernel

2. 作为 Python 包导入 (Python Package Import)

from bigvgan_standalone.bigvgan_api import BigVGANVocoder

# 加载 vocoder(config_dir 包含 config.json)
# Load vocoder (config_dir contains config.json)
vocoder = BigVGANVocoder(
    config_or_dir="exp/bigvgan_teochew_22khz",
    checkpoint_path="g_05330000"  # 文件名,自动拼接目录 / filename, auto-concatenated with dir
)

# 单文件重构 (Single file reconstruction)
vocoder.reconstruct_wav("input.wav", "output.wav")

# 批量重构 (Batch reconstruction)
vocoder.reconstruct_wav_batch(
    ["a.wav", "b.wav"],
    output_dir="results/"
)

# 从文件列表 (From file list)
vocoder.reconstruct_from_file_list(
    "filelists/teochew/test.txt",
    output_dir="results/",
    dataset_dir="/data/teochew_extla/",
    max_samples=100
)

# 从梅尔频谱文件 (From mel-spectrogram file)
vocoder.mel_file_to_wav("mel.npy", "output.wav")

# 底层 Tensor 级别操作 (Low-level Tensor operations)
audio = vocoder.load_wav("input.wav")    # auto-resample + normalize
mel = vocoder.compute_mel(audio)          # [1, 80, T]
recon = vocoder.mel_to_wav(mel)           # int16 np array

3. 在 Gradio / HuggingFace Space 中使用 (Use with Gradio / HuggingFace Space)

import gradio as gr
from bigvgan_standalone.bigvgan_api import BigVGANVocoder

vocoder = BigVGANVocoder("exp/bigvgan_teochew_22khz", "g_05330000")

demo = gr.Interface(
    fn=vocoder.gradio_reconstruct,
    inputs=gr.Audio(type="numpy"),
    outputs=gr.Audio(type="numpy"),
    title="BigVGAN Vocoder Demo"
)
demo.launch()

评估结果 (Evaluation Results)

使用 DNSMOS 指标对不同检查点进行评估 (DNSMOS metrics for different checkpoints):

Checkpoint SIG BAK OVRL Notes
g_05330000 3.4721 3.8557 3.1030 Best
g_05350000 3.4650 3.8570 3.0989 -
g_05190000 3.4709 3.8464 3.0971 -
g_05390000 3.4673 3.8522 3.0968 -
g_05280000 3.4732 3.8447 3.0959 -
g_05110000 3.4654 3.8473 3.0926 -
g_05210000 3.4674 3.8418 3.0924 -
g_05140000 3.4693 3.8424 3.0923 -
... ... ... ... ...

完整评估结果 (Full evaluation results, 34 checkpoints tested):

最佳性能出现在 g_05330000 (iteration 5,330,000),OVRL 分数达到 3.1030,与 ground truth (3.1040) 非常接近。

(Best performance achieved at g_05330000 (iteration 5,330,000) with OVRL score of 3.1030, very close to ground truth at 3.1040.)

推荐使用 (Recommended): Checkpoint g_05330000,在 OVRL(整体质量 / overall quality)指标上表现最佳 (achieves the best score)。

性能对比 (Performance Comparison)

与同数据集训练的 HiFi-GAN 相比 (Compared with HiFi-GAN trained on the same dataset):

Model DNSMOS OVRL Training Time Notes
BigVGAN (this) 3.1030 ~10 days Fine-tuned from pretrained
HiFi-GAN 3.0724 ~8 days Trained from scratch
Ground Truth 3.1040 - -

BigVGAN 在音质上略优于 HiFi-GAN (+0.0306),几乎达到原始音频质量。

(BigVGAN slightly outperforms HiFi-GAN (+0.0306), nearly matching the original audio quality.)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for panlr/BigVGAN_22khz_teochew

Finetuned
(1)
this model

Collection including panlr/BigVGAN_22khz_teochew