| # VibeVoice 1.5 CoreML | |
| VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS. | |
| ## Model Components | |
| | Component | File | Description | | |
| |-----------|------|-------------| | |
| | Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) | | |
| | Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio | | |
| | Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) | | |
| | Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space | | |
| | Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space | | |
| | LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model | | |
| | Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising | | |
| ## Requirements | |
| - **Platform**: macOS 14.0+ or iOS 17.0+ | |
| - **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended) | |
| ## Usage | |
| ### Swift (iOS/macOS) | |
| Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files: | |
| ```swift | |
| let modelDir = URL(fileURLWithPath: "/path/to/models") | |
| let pipeline = try VibeVoicePipeline(modelDirectory: modelDir) | |
| // Encode, run LLM, diffusion, decode... | |
| ``` | |
| ### Python (macOS only) | |
| CoreML models require macOS to load and run: | |
| ```bash | |
| python inference.py --models-dir ./models --text "Hello world" | |
| ``` | |
| ## Configuration | |
| - **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding. | |
| - **Diffusion**: 20 steps, cosine beta schedule, v_prediction. | |
| - **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based). | |
| See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings. | |
| ## Conversion Notes | |
| Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details. | |
| ## License | |
| Refer to the original VibeVoice model license. | |