# VibeVoice 1.5 CoreML

VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.

## Model Components

| Component | File | Description |
|-----------|------|-------------|
| Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
| Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
| Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
| Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
| Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
| LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
| Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |

## Requirements

- **Platform**: macOS 14.0+ or iOS 17.0+
- **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)

## Usage

### Swift (iOS/macOS)

Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:

```swift
let modelDir = URL(fileURLWithPath: "/path/to/models")
let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
// Encode, run LLM, diffusion, decode...
```

### Python (macOS only)

CoreML models require macOS to load and run:

```bash
python inference.py --models-dir ./models --text "Hello world"
```

## Configuration

- **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
- **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
- **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).

See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.

## Conversion Notes

Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.

## License

Refer to the original VibeVoice model license.