HiFi-GAN Vocoder for Teochew TTS

模型简介 (Model Description)

本模型是基于 HiFi-GAN 架构训练的潮州话语音合成声码器，用于从梅尔频谱重建高质量音频波形。模型基于 NVIDIA DeepLearningExamples 的训练代码从头开始训练。

(This is a HiFi-GAN vocoder trained for Teochew speech synthesis, designed to reconstruct high-quality audio waveforms from mel-spectrograms. The model was trained from scratch using NVIDIA's DeepLearningExamples training code.)

训练数据 (Training Data)

数据集 (Dataset): teochew-extLa
筛选标准 (Filter Criteria): DNSMOS OVRL > 2.8
数据量 (Data Size): 约 283,000 条音频 (~283,000 audio clips)
总时长 (Total Duration): 555+ 小时 (hours)

训练配置 (Training Configuration)

硬件环境 (Hardware):

双 GPU 训练 (Dual GPU training)
Batch Size: 64
Gradient Accumulation: 4

超参数 (Hyperparameters):

Epochs: 400
Learning Rate: 0.0003
Learning Rate Decay: 0.9998

训练时长 (Training Duration): 约 8 天 (~8 days)

模型架构 (Model Architecture)

生成器配置 (Generator Config):

{
    "upsample_rates": [8, 8, 2, 2],
    "upsample_kernel_sizes": [16, 16, 4, 4],
    "upsample_initial_channel": 512,
    "resblock": "1",
    "resblock_kernel_sizes": [3, 7, 11],
    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
}

梅尔频谱配置 (Mel-Spectrogram Config):

{
    "sampling_rate": 22050,
    "filter_length": 1024,
    "num_mels": 80,
    "hop_length": 256,
    "win_length": 1024,
    "mel_fmin": 0.0,
    "mel_fmax": 11025.0,
    "max_wav_value": 32768.0
}

使用方法 (Usage)

1. 命令行界面 (Command Line Interface)

cd hifigan_standalone
python hifigan_api.py --checkpoint path/to/ckpt.pt --input audio.wav --output out.wav

2. 作为 Python 包导入 (Python Package Import)

# 将 hifigan_standalone 所在目录加入 PYTHONPATH，或者放到项目中
# Add hifigan_standalone directory to PYTHONPATH or place it in your project
from hifigan_standalone import HiFiGANVocoder

vocoder = HiFiGANVocoder("path/to/checkpoint.pt")
vocoder.reconstruct_wav("input.wav", "output.wav")

3. 在 Gradio / HuggingFace Space 中使用 (Use with Gradio / HuggingFace Space)

import gradio as gr
from hifigan_standalone import HiFiGANVocoder

vocoder = HiFiGANVocoder("ckpt.pt")
demo = gr.Interface(
    fn=vocoder.gradio_reconstruct,
    inputs=gr.Audio(),
    outputs=gr.Audio()
)
demo.launch()

评估结果 (Evaluation Results)

使用 DNSMOS 指标对不同检查点进行评估 (DNSMOS metrics for different checkpoints):

Checkpoint	SIG	BAK	OVRL	Notes
Ground Truth	3.4699	3.8607	3.1040	-
Epoch 100	3.4416	3.7775	3.0382	-
Epoch 210	3.4737	3.7889	3.0724	Best
Epoch 300	3.4232	3.7777	3.0250	Sounds better than the Epoch 210.
Epoch 400	3.3923	3.7841	3.0050	过拟合 (Overfitting)

推荐使用 (Recommended): Checkpoint 210（Epoch 210） or Checkpoint 300（Epoch 300）.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including panlr/hifigan_teochew

teochew SpeechSynthesis

Collection

潮汕话-语音合成-模型-数据-工具 • 6 items • Updated Mar 6