aoiandroid
/

vibevoice-1.5-coreml

Core ML

Model card Files Files and versions

xet

Community

aoiandroid commited on Feb 12

Commit

0d982a6

verified ·

1 Parent(s): 32dd71a

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +56 -56

README.md CHANGED Viewed

@@ -1,56 +1,56 @@
-# VibeVoice 1.5 CoreML
-VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
-## Model Components
-| Component | File | Description |
-|-----------|------|-------------|
-| Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
-| Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
-| Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
-| Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
-| Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
-| LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
-| Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
-## Requirements
-- **Platform**: macOS 14.0+ or iOS 17.0+
-- **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
-## Usage
-### Swift (iOS/macOS)
-Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
-```swift
-let modelDir = URL(fileURLWithPath: "/path/to/models")
-let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
-// Encode, run LLM, diffusion, decode...
-```
-### Python (macOS only)
-CoreML models require macOS to load and run:
-```bash
-python inference.py --models-dir ./models --text "Hello world"
-```
-## Configuration
-- **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
-- **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
-- **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
-See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
-## Conversion Notes
-Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
-## License
-Refer to the original VibeVoice model license.

+# VibeVoice 1.5 CoreML
+VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
+## Model Components
+| Component | File | Description |
+|-----------|------|-------------|
+| Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
+| Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
+| Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
+| Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
+| Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
+| LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
+| Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
+## Requirements
+- **Platform**: macOS 14.0+ or iOS 17.0+
+- **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
+## Usage
+### Swift (iOS/macOS)
+Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
+```swift
+let modelDir = URL(fileURLWithPath: "/path/to/models")
+let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
+// Encode, run LLM, diffusion, decode...
+```
+### Python (macOS only)
+CoreML models require macOS to load and run:
+```bash
+python inference.py --models-dir ./models --text "Hello world"
+```
+## Configuration
+- **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
+- **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
+- **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
+See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
+## Conversion Notes
+Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
+## License
+Refer to the original VibeVoice model license.