Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,56 +1,56 @@
|
|
| 1 |
-
# VibeVoice 1.5 CoreML
|
| 2 |
-
|
| 3 |
-
VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
|
| 4 |
-
|
| 5 |
-
## Model Components
|
| 6 |
-
|
| 7 |
-
| Component | File | Description |
|
| 8 |
-
|-----------|------|-------------|
|
| 9 |
-
| Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
|
| 10 |
-
| Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
|
| 11 |
-
| Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
|
| 12 |
-
| Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
|
| 13 |
-
| Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
|
| 14 |
-
| LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
|
| 15 |
-
| Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
|
| 16 |
-
|
| 17 |
-
## Requirements
|
| 18 |
-
|
| 19 |
-
- **Platform**: macOS 14.0+ or iOS 17.0+
|
| 20 |
-
- **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
|
| 21 |
-
|
| 22 |
-
## Usage
|
| 23 |
-
|
| 24 |
-
### Swift (iOS/macOS)
|
| 25 |
-
|
| 26 |
-
Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
|
| 27 |
-
|
| 28 |
-
```swift
|
| 29 |
-
let modelDir = URL(fileURLWithPath: "/path/to/models")
|
| 30 |
-
let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
|
| 31 |
-
// Encode, run LLM, diffusion, decode...
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
### Python (macOS only)
|
| 35 |
-
|
| 36 |
-
CoreML models require macOS to load and run:
|
| 37 |
-
|
| 38 |
-
```bash
|
| 39 |
-
python inference.py --models-dir ./models --text "Hello world"
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
## Configuration
|
| 43 |
-
|
| 44 |
-
- **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
|
| 45 |
-
- **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
|
| 46 |
-
- **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
|
| 47 |
-
|
| 48 |
-
See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
|
| 49 |
-
|
| 50 |
-
## Conversion Notes
|
| 51 |
-
|
| 52 |
-
Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
|
| 53 |
-
|
| 54 |
-
## License
|
| 55 |
-
|
| 56 |
-
Refer to the original VibeVoice model license.
|
|
|
|
| 1 |
+
# VibeVoice 1.5 CoreML
|
| 2 |
+
|
| 3 |
+
VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
|
| 4 |
+
|
| 5 |
+
## Model Components
|
| 6 |
+
|
| 7 |
+
| Component | File | Description |
|
| 8 |
+
|-----------|------|-------------|
|
| 9 |
+
| Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
|
| 10 |
+
| Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
|
| 11 |
+
| Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
|
| 12 |
+
| Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
|
| 13 |
+
| Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
|
| 14 |
+
| LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
|
| 15 |
+
| Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
|
| 16 |
+
|
| 17 |
+
## Requirements
|
| 18 |
+
|
| 19 |
+
- **Platform**: macOS 14.0+ or iOS 17.0+
|
| 20 |
+
- **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
|
| 21 |
+
|
| 22 |
+
## Usage
|
| 23 |
+
|
| 24 |
+
### Swift (iOS/macOS)
|
| 25 |
+
|
| 26 |
+
Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
|
| 27 |
+
|
| 28 |
+
```swift
|
| 29 |
+
let modelDir = URL(fileURLWithPath: "/path/to/models")
|
| 30 |
+
let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
|
| 31 |
+
// Encode, run LLM, diffusion, decode...
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### Python (macOS only)
|
| 35 |
+
|
| 36 |
+
CoreML models require macOS to load and run:
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
python inference.py --models-dir ./models --text "Hello world"
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Configuration
|
| 43 |
+
|
| 44 |
+
- **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
|
| 45 |
+
- **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
|
| 46 |
+
- **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
|
| 47 |
+
|
| 48 |
+
See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
|
| 49 |
+
|
| 50 |
+
## Conversion Notes
|
| 51 |
+
|
| 52 |
+
Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
|
| 53 |
+
|
| 54 |
+
## License
|
| 55 |
+
|
| 56 |
+
Refer to the original VibeVoice model license.
|