VibeVoice-Realtime-0.5B CoreML

CoreML conversion of microsoft/VibeVoice-Realtime-0.5B for iOS 18+.

Model Components

Model Size Quantized
TTS LLM (Stateful) 829 MB 208 MB
Diffusion Head 80 MB 20 MB
Acoustic Decoder 656 MB 164 MB

Total: 1,565 MB (original) / 392 MB (quantized 4-bit)

Requirements

  • iOS 18+ (Stateful Models for TTS LLM)
  • CoreML framework

Files

  • vibevoice_tts_llm_quantized.mlpackage - TTS LLM (recommended)
  • vibevoice_diffusion_head_quantized.mlpackage - Diffusion Head
  • vibevoice_acoustic_decoder_quantized.mlpackage - Acoustic Decoder
  • conversion_report.md - Full conversion details
  • benchmark_results.json - Performance metrics

Inference Pipeline

  1. Tokenize text (Qwen2 tokenizer)
  2. TTS LLM -> hidden states
  3. Diffusion Head (20 steps) -> acoustic latents
  4. Acoustic Decoder -> audio waveform (24kHz)

Source

Converted from microsoft/VibeVoice-Realtime-0.5B.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support