Training: Trained from scratch (random initialization) - NOT fine-tuned from pretrained

Instasamka Vocoder

HiFi-GAN vocoder trained from scratch on voice recordings for text-to-speech synthesis.

Model Details

  • Architecture: HiFi-GAN (generator + discriminator)
  • Training: Trained from scratch for 10,000 epochs
  • Final Metrics:
    • Generator Loss: ~32
    • Mel-Spectrogram Loss: ~0.32
    • Discriminator Loss: ~2.2
    • Final Validation Error: ~0.37

Training Configuration

  • Sample Rate: 22050 Hz
  • FFT Size: 1024
  • Hop Length: 256
  • Win Length: 1024
  • Mel Channels: 80
  • Batch Size: 8
  • Training Epochs: 10,000

Usage

import torch
import yaml

# Load config
with open('config.json') as f:
    config = yaml.safe_load(f)

# Load generator
generator = Generator(config).cuda()
state_dict = torch.load('generator_best.pt')
generator.load_state_dict(state_dict)
generator.eval()

# Generate audio
with torch.no_grad():
    audio = generator(mel_spectrogram)

Training Data

Voice recordings processed to 22050 Hz mono 15-second chunks.

Limitations

  • Trained on limited dataset (single speaker/style)
  • Mel-error ~0.32 indicates room for improvement
  • May produce artifacts on out-of-distribution text

Citations

@article{kong2020hifigan,
  title={HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis},
  author={Kong, Jungil and Kim, Jaehyeon and Bae, Jack},
  journal={arXiv preprint arXiv:2010.05646},
  year={2020}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Kopamed/instasamka-vocoder