aoiandroid's picture
Upload README.md with huggingface_hub
0d982a6 verified
# VibeVoice 1.5 CoreML
VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
## Model Components
| Component | File | Description |
|-----------|------|-------------|
| Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
| Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
| Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
| Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
| Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
| LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
| Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
## Requirements
- **Platform**: macOS 14.0+ or iOS 17.0+
- **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
## Usage
### Swift (iOS/macOS)
Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
```swift
let modelDir = URL(fileURLWithPath: "/path/to/models")
let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
// Encode, run LLM, diffusion, decode...
```
### Python (macOS only)
CoreML models require macOS to load and run:
```bash
python inference.py --models-dir ./models --text "Hello world"
```
## Configuration
- **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
- **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
- **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
## Conversion Notes
Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
## License
Refer to the original VibeVoice model license.