BigVGAN Vocoder for Teochew TTS
模型简介 (Model Description)
本模型是基于 BigVGANv2 架构训练的潮州话语音合成声码器,用于从梅尔频谱重建高质量音频波形。模型基于 NVIDIA BigVGAN 训练代码,从预训练权重进行微调。
(This is a BigVGAN vocoder trained for Teochew speech synthesis, designed to reconstruct high-quality audio waveforms from mel-spectrograms. The model was fine-tuned from pre-trained weights using NVIDIA's BigVGAN training code.)
预训练权重 (Pre-trained Weights)
- 基础模型 (Base Model): nvidia/bigvgan_v2_22khz_80band_256x
训练数据 (Training Data)
- 数据集 (Dataset): teochew-extLa
- 筛选标准 (Filter Criteria): DNSMOS > 2.8
- 数据量 (Data Size): 约 283,000 条音频 (~283,000 audio clips)
- 总时长 (Total Duration): 555+ 小时 (hours)
训练配置 (Training Configuration)
硬件环境 (Hardware):
- 4 × V100 GPU
- Batch Size: 4
训练迭代 (Training Iterations):
- Iterations: 400,000(约 5-6 个 epochs / ~5-6 epochs, very slow)
训练时长 (Training Duration): 约 10 天 (~10 days)
超参数 (Hyperparameters):
Learning Rate: 0.0001
Adam β1: 0.8
Adam β2: 0.99
LR Decay: 0.9999996
Gradient Clip Norm: 500
模型架构 (Model Architecture)
生成器配置 (Generator Config):
{
"upsample_rates": [4, 4, 2, 2, 2, 2],
"upsample_kernel_sizes": [8, 8, 4, 4, 4, 4],
"upsample_initial_channel": 1536,
"resblock": "1",
"resblock_kernel_sizes": [3, 7, 11],
"resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
"activation": "snakebeta",
"snake_logscale": true,
"use_tanh_at_final": false,
"use_bias_at_final": false
}
判别器配置 (Discriminator Config):
{
"use_cqtd_instead_of_mrd": true,
"cqtd_filters": 128,
"cqtd_max_filters": 1024,
"cqtd_filters_scale": 1,
"cqtd_dilations": [1, 2, 4],
"cqtd_hop_lengths": [512, 256, 256],
"cqtd_n_octaves": [9, 9, 9],
"cqtd_bins_per_octaves": [24, 36, 48],
"mpd_reshapes": [2, 3, 5, 7, 11],
"use_spectral_norm": false,
"discriminator_channel_mult": 1
}
梅尔频谱配置 (Mel-Spectrogram Config):
{
"sampling_rate": 22050,
"n_fft": 1024,
"num_mels": 80,
"hop_size": 256,
"win_size": 1024,
"fmin": 0,
"fmax": null,
"segment_size": 65536,
"use_multiscale_melloss": true,
"lambda_melloss": 15
}
使用方法 (Usage)
1. 命令行界面 (Command Line Interface)
cd bigvgan_standalone
# 单文件重构 (Single file reconstruction)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
--input audio.wav --output out.wav
# 批量处理 (Batch processing)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
--input a.wav b.wav c.wav --output_dir results/
# 从梅尔频谱重构 (Reconstruct from mel-spectrogram)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
--mel mel.npy --output out.wav
# 从文件列表处理 (Process from file list)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
--file_list test.txt --dataset_dir /data/ --output_dir results/ --max_samples 100
# CUDA kernel 加速 (CUDA kernel acceleration)
python bigvgan_api.py --config_dir exp/bigvgan_teochew_22khz --checkpoint g_05330000 \
--input audio.wav --output out.wav --use_cuda_kernel
2. 作为 Python 包导入 (Python Package Import)
from bigvgan_standalone.bigvgan_api import BigVGANVocoder
# 加载 vocoder(config_dir 包含 config.json)
# Load vocoder (config_dir contains config.json)
vocoder = BigVGANVocoder(
config_or_dir="exp/bigvgan_teochew_22khz",
checkpoint_path="g_05330000" # 文件名,自动拼接目录 / filename, auto-concatenated with dir
)
# 单文件重构 (Single file reconstruction)
vocoder.reconstruct_wav("input.wav", "output.wav")
# 批量重构 (Batch reconstruction)
vocoder.reconstruct_wav_batch(
["a.wav", "b.wav"],
output_dir="results/"
)
# 从文件列表 (From file list)
vocoder.reconstruct_from_file_list(
"filelists/teochew/test.txt",
output_dir="results/",
dataset_dir="/data/teochew_extla/",
max_samples=100
)
# 从梅尔频谱文件 (From mel-spectrogram file)
vocoder.mel_file_to_wav("mel.npy", "output.wav")
# 底层 Tensor 级别操作 (Low-level Tensor operations)
audio = vocoder.load_wav("input.wav") # auto-resample + normalize
mel = vocoder.compute_mel(audio) # [1, 80, T]
recon = vocoder.mel_to_wav(mel) # int16 np array
3. 在 Gradio / HuggingFace Space 中使用 (Use with Gradio / HuggingFace Space)
import gradio as gr
from bigvgan_standalone.bigvgan_api import BigVGANVocoder
vocoder = BigVGANVocoder("exp/bigvgan_teochew_22khz", "g_05330000")
demo = gr.Interface(
fn=vocoder.gradio_reconstruct,
inputs=gr.Audio(type="numpy"),
outputs=gr.Audio(type="numpy"),
title="BigVGAN Vocoder Demo"
)
demo.launch()
评估结果 (Evaluation Results)
使用 DNSMOS 指标对不同检查点进行评估 (DNSMOS metrics for different checkpoints):
| Checkpoint | SIG | BAK | OVRL | Notes |
|---|---|---|---|---|
| g_05330000 | 3.4721 | 3.8557 | 3.1030 | Best |
| g_05350000 | 3.4650 | 3.8570 | 3.0989 | - |
| g_05190000 | 3.4709 | 3.8464 | 3.0971 | - |
| g_05390000 | 3.4673 | 3.8522 | 3.0968 | - |
| g_05280000 | 3.4732 | 3.8447 | 3.0959 | - |
| g_05110000 | 3.4654 | 3.8473 | 3.0926 | - |
| g_05210000 | 3.4674 | 3.8418 | 3.0924 | - |
| g_05140000 | 3.4693 | 3.8424 | 3.0923 | - |
| ... | ... | ... | ... | ... |
完整评估结果 (Full evaluation results, 34 checkpoints tested):
最佳性能出现在 g_05330000 (iteration 5,330,000),OVRL 分数达到 3.1030,与 ground truth (3.1040) 非常接近。
(Best performance achieved at g_05330000 (iteration 5,330,000) with OVRL score of 3.1030, very close to ground truth at 3.1040.)
推荐使用 (Recommended): Checkpoint g_05330000,在 OVRL(整体质量 / overall quality)指标上表现最佳 (achieves the best score)。
性能对比 (Performance Comparison)
与同数据集训练的 HiFi-GAN 相比 (Compared with HiFi-GAN trained on the same dataset):
| Model | DNSMOS OVRL | Training Time | Notes |
|---|---|---|---|
| BigVGAN (this) | 3.1030 | ~10 days | Fine-tuned from pretrained |
| HiFi-GAN | 3.0724 | ~8 days | Trained from scratch |
| Ground Truth | 3.1040 | - | - |
BigVGAN 在音质上略优于 HiFi-GAN (+0.0306),几乎达到原始音频质量。
(BigVGAN slightly outperforms HiFi-GAN (+0.0306), nearly matching the original audio quality.)
Model tree for panlr/BigVGAN_22khz_teochew
Base model
nvidia/bigvgan_v2_22khz_80band_256x