VibeVoice CoreMl
Collection
VibeVoice models (TTS/STT) converted to CoreML • 4 items • Updated
How to use gafiatulin/vibevoice-asr-coreml with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
# Load voice sample (should be 24kHz mono)
voice, sr = sf.read("path/to/voice_sample.wav")
if voice.ndim > 1: voice = voice.mean(axis=1)
if sr != 24000: voice = librosa.resample(voice, sr, 24000)
processor = VibeVoiceProcessor.from_pretrained("gafiatulin/vibevoice-asr-coreml")
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
"gafiatulin/vibevoice-asr-coreml", torch_dtype=torch.bfloat16
).to("cuda").eval()
model.set_ddpm_inference_steps(5)
inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"],
voice_samples=[[voice]], return_tensors="pt")
audio = model.generate(**inputs, cfg_scale=1.3,
tokenizer=processor.tokenizer).speech_outputs[0]
sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000)VibeVoice ASR (Qwen2-7B, 8.3B params) — CoreML INT8, fused LM+head, fused encoder, fused projector. 50+ languages, 60-minute single-pass transcription.
Add vibevoice-coreml to your Swift package. Models auto-download from this repo on first use.
import VibeVoiceCoreML
let stt = try await SpeechToText()
let result = try await stt.transcribe(audioURL)
print(result.text)
See the GitHub repo for CLI usage, Python pipelines, and conversion scripts.
ct.StateType for stateful models).mlmodelc — no on-device compilation neededfused_encoder.mlmodelcfused_projector.mlmodelclm_decoder_fused_int8.mlmodelcembed_tokens.bintokenizer.jsontokenizer_config.jsonMIT (same as upstream VibeVoice models from Microsoft)