VibeVoice 1.5 CoreML

VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.

Model Components

Component	File	Description
Acoustic Encoder	`vibevoice_acoustic_encoder.mlpackage`	Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz)
Acoustic Decoder	`vibevoice_acoustic_decoder.mlpackage`	Decodes acoustic latent to audio
Semantic Encoder	`vibevoice_semantic_encoder.mlpackage`	Encodes audio to semantic latent (fixed input: 24000 samples)
Acoustic Connector	`vibevoice_acoustic_connector.mlpackage`	Projects acoustic latent to LLM hidden space
Semantic Connector	`vibevoice_semantic_connector.mlpackage`	Projects semantic latent to LLM hidden space
LLM	`vibevoice_llm.mlpackage`	Qwen2-1.5B-based language model
Diffusion Head	`vibevoice_diffusion_head.mlpackage`	Single-step diffusion denoising

Requirements

Platform: macOS 14.0+ or iOS 17.0+
Framework: CoreML (Apple Silicon or Intel with Neural Engine recommended)

Usage

Swift (iOS/macOS)

Use VibeVoicePipeline.swift with a directory containing all .mlpackage files:

let modelDir = URL(fileURLWithPath: "/path/to/models")
let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
// Encode, run LLM, diffusion, decode...

Python (macOS only)

CoreML models require macOS to load and run:

python inference.py --models-dir ./models --text "Hello world"

Configuration

Audio: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
Diffusion: 20 steps, cosine beta schedule, v_prediction.
LLM: 1536 hidden size, 28 layers (Qwen2-1.5B-based).

See vibevoice_config.json and vibevoice_pipeline_config.json for full settings.

Conversion Notes

Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via torch.export + custom op registration. See CONVERSION_RESULTS.md for details.

License

Refer to the original VibeVoice model license.

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support