BigVGAN_T50_48k_CoreML (FP32)

Overview

This is a CoreML decoder based on BigVGAN, converting latent tensors into raw audio waveforms at 48 kHz. It is intended to be used as the final decoding stage in a latent-audio pipeline, leveraging the CoreML stuff with a significant speed up and stability against ONNX or PyTorch in MPS.

The model was originally designed to decode latents produced by DiffRhythm2. The original PyTorch weights come from: https://huggingface.co/ASLP-lab/DiffRhythm2/blob/main/decoder.bin

While the decoder may work with other compatible latents (haven't tried though), DiffRhythm2 is the reference source.

Tested with DiffRhythm2 FP32 and FP16 (see: https://huggingface.co/qpqpqpqpqpqp/DiffRhythm2_fp16)

Simple Usage

import torch, torchaudio
import coremltools as ct

model = ct.models.MLModel("BigVGAN_T50.mlpackage")
audio = model.predict({"latent": latent.cpu()})["clip_0"]
torchaudio.save("output.wav", torch.from_numpy(audio)[None, :], 48000, format="wav")

Input

Tensor shape: [1, 64, T]
Type: float32 or float16 (both accepted, identical output)
64 → latent channel dimension
T → temporal latent length

Base time unit (important) The CoreML model is exported with a fixed base window of T = 50. Original bigvgan.decode_audio() does the work internally, so there's no T restriction. The same behavior can be reproduced with ONNX but not in CoreML.

PyTorch Internals (from decoder.json):

fps = 5
T = 50 ≈ 10 seconds of audio
1 latent frame ≈ 0.2 seconds

In CoreML, this model expects exactly T = 50 per call.

Output

Tensor shape: [1, 1, 480000]
Sampling rate: 48,000 Hz (fixed by the model)
Duration: ~10 seconds per decode call

The output length is fixed and non-configurable.

Precision behavior

Model is exported as FP32 with macOS14 minimum deployment target.
Inputs may be float16 or float32. No casting is required.

Decoding longer audio (required)

For latents with T > 50, decoding must be done manually by chunking.

Recommended:

Chunk size: T = 50 → For this model, always 50, but the script keeps dynamic if the model is converted to other T.
Overlap: 5 frames (though with 10~30 haven't heard any substantial difference. Open a discussion if you found a better way to achieve it.)
Reconstruction: windowed overlap-add

Padding is required for the final chunk if T is not divisible by 50. Extra decoded audio introduced by padding must be trimmed.

Naive overlap causes noticeable gain increase. Zero overlap avoids gain increase but may introduce small boundary artifacts.

Reference decode_latent_coreml() implementation (CoreML)

import numpy as np

def decode_latent_coreml(mlmodel, latent, T=50, overlap=5):
    latent = latent.cpu().numpy()

    step = T - overlap
    up = 9600
    hop_audio = step * up

    audio_chunks = []

    for i in range(0, latent.shape[2], step):
        chunk = latent[:, :, i:i+T]
        valid_T = chunk.shape[2]

        if valid_T < T:
            pad = chunk[:, :, -1:]
            pad = np.repeat(pad, T - valid_T, axis=2)
            chunk = np.concatenate([chunk, pad], axis=2)

        audio = mlmodel.predict({"latent": chunk})["clip_0"]
        audio_chunks.append(audio.squeeze().astype(np.float32))

        if i + T >= latent.shape[2]:
            break

    L = audio_chunks[0].shape[-1]
    overlap_audio = L - hop_audio

    fade = 0.5 - 0.5 * np.cos(np.linspace(0, np.pi, overlap_audio))
    fade_in = fade
    fade_out = 1.0 - fade

    out_len = hop_audio * (len(audio_chunks) - 1) + L
    out = np.zeros(out_len, dtype=np.float32)

    for i, a in enumerate(audio_chunks):
        start = i * hop_audio
        end = start + L

        if i > 0:
            a[:overlap_audio] *= fade_in
        if i < len(audio_chunks) - 1:
            a[-overlap_audio:] *= fade_out

        out[start:end] += a

    true_len = latent.shape[2] * up
    return out[:true_len]

Saving audio (output array)

Using torchaudio:

import torch, torchaudio
torchaudio.save(
    "output.wav",
    torch.from_numpy(audio)[None, :],
    48000
)

Using pedalboard:

from pedalboard.io import AudioFile
with AudioFile("output.wav", "w", 48000, 1) as f:
    f.write(audio[None, :])

Limitations / caveats

•	Fixed input length (T = 50) per CoreML call
•	No internal chunking
•	Padding + trimming is mandatory for arbitrary-length latents
•	No INT8 / quantized variant
•	Overlap handling is entirely user-controlled

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support