HiFi-GAN Vocoder for Teochew TTS

模型简介 (Model Description)

本模型是基于 HiFi-GAN 架构训练的潮州话语音合成声码器,用于从梅尔频谱重建高质量音频波形。模型基于 NVIDIA DeepLearningExamples 的训练代码从头开始训练。

(This is a HiFi-GAN vocoder trained for Teochew speech synthesis, designed to reconstruct high-quality audio waveforms from mel-spectrograms. The model was trained from scratch using NVIDIA's DeepLearningExamples training code.)

训练数据 (Training Data)

  • 数据集 (Dataset): teochew-extLa
  • 筛选标准 (Filter Criteria): DNSMOS OVRL > 2.8
  • 数据量 (Data Size): 约 283,000 条音频 (~283,000 audio clips)
  • 总时长 (Total Duration): 555+ 小时 (hours)

训练配置 (Training Configuration)

硬件环境 (Hardware):

  • 双 GPU 训练 (Dual GPU training)
  • Batch Size: 64
  • Gradient Accumulation: 4

超参数 (Hyperparameters):

Epochs: 400
Learning Rate: 0.0003
Learning Rate Decay: 0.9998

训练时长 (Training Duration): 约 8 天 (~8 days)

模型架构 (Model Architecture)

生成器配置 (Generator Config):

{
    "upsample_rates": [8, 8, 2, 2],
    "upsample_kernel_sizes": [16, 16, 4, 4],
    "upsample_initial_channel": 512,
    "resblock": "1",
    "resblock_kernel_sizes": [3, 7, 11],
    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
}

梅尔频谱配置 (Mel-Spectrogram Config):

{
    "sampling_rate": 22050,
    "filter_length": 1024,
    "num_mels": 80,
    "hop_length": 256,
    "win_length": 1024,
    "mel_fmin": 0.0,
    "mel_fmax": 11025.0,
    "max_wav_value": 32768.0
}

使用方法 (Usage)

1. 命令行界面 (Command Line Interface)

cd hifigan_standalone
python hifigan_api.py --checkpoint path/to/ckpt.pt --input audio.wav --output out.wav

2. 作为 Python 包导入 (Python Package Import)

# 将 hifigan_standalone 所在目录加入 PYTHONPATH,或者放到项目中
# Add hifigan_standalone directory to PYTHONPATH or place it in your project
from hifigan_standalone import HiFiGANVocoder

vocoder = HiFiGANVocoder("path/to/checkpoint.pt")
vocoder.reconstruct_wav("input.wav", "output.wav")

3. 在 Gradio / HuggingFace Space 中使用 (Use with Gradio / HuggingFace Space)

import gradio as gr
from hifigan_standalone import HiFiGANVocoder

vocoder = HiFiGANVocoder("ckpt.pt")
demo = gr.Interface(
    fn=vocoder.gradio_reconstruct,
    inputs=gr.Audio(),
    outputs=gr.Audio()
)
demo.launch()

评估结果 (Evaluation Results)

使用 DNSMOS 指标对不同检查点进行评估 (DNSMOS metrics for different checkpoints):

Checkpoint SIG BAK OVRL Notes
Ground Truth 3.4699 3.8607 3.1040 -
Epoch 100 3.4416 3.7775 3.0382 -
Epoch 210 3.4737 3.7889 3.0724 Best
Epoch 300 3.4232 3.7777 3.0250 Sounds better than the Epoch 210.
Epoch 400 3.3923 3.7841 3.0050 过拟合 (Overfitting)

推荐使用 (Recommended): Checkpoint 210(Epoch 210) or Checkpoint 300(Epoch 300).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including panlr/hifigan_teochew