aoiandroid commited on
Commit
0d982a6
·
verified ·
1 Parent(s): 32dd71a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -56
README.md CHANGED
@@ -1,56 +1,56 @@
1
- # VibeVoice 1.5 CoreML
2
-
3
- VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
4
-
5
- ## Model Components
6
-
7
- | Component | File | Description |
8
- |-----------|------|-------------|
9
- | Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
10
- | Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
11
- | Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
12
- | Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
13
- | Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
14
- | LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
15
- | Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
16
-
17
- ## Requirements
18
-
19
- - **Platform**: macOS 14.0+ or iOS 17.0+
20
- - **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
21
-
22
- ## Usage
23
-
24
- ### Swift (iOS/macOS)
25
-
26
- Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
27
-
28
- ```swift
29
- let modelDir = URL(fileURLWithPath: "/path/to/models")
30
- let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
31
- // Encode, run LLM, diffusion, decode...
32
- ```
33
-
34
- ### Python (macOS only)
35
-
36
- CoreML models require macOS to load and run:
37
-
38
- ```bash
39
- python inference.py --models-dir ./models --text "Hello world"
40
- ```
41
-
42
- ## Configuration
43
-
44
- - **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
45
- - **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
46
- - **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
47
-
48
- See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
49
-
50
- ## Conversion Notes
51
-
52
- Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
53
-
54
- ## License
55
-
56
- Refer to the original VibeVoice model license.
 
1
+ # VibeVoice 1.5 CoreML
2
+
3
+ VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.
4
+
5
+ ## Model Components
6
+
7
+ | Component | File | Description |
8
+ |-----------|------|-------------|
9
+ | Acoustic Encoder | `vibevoice_acoustic_encoder.mlpackage` | Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) |
10
+ | Acoustic Decoder | `vibevoice_acoustic_decoder.mlpackage` | Decodes acoustic latent to audio |
11
+ | Semantic Encoder | `vibevoice_semantic_encoder.mlpackage` | Encodes audio to semantic latent (fixed input: 24000 samples) |
12
+ | Acoustic Connector | `vibevoice_acoustic_connector.mlpackage` | Projects acoustic latent to LLM hidden space |
13
+ | Semantic Connector | `vibevoice_semantic_connector.mlpackage` | Projects semantic latent to LLM hidden space |
14
+ | LLM | `vibevoice_llm.mlpackage` | Qwen2-1.5B-based language model |
15
+ | Diffusion Head | `vibevoice_diffusion_head.mlpackage` | Single-step diffusion denoising |
16
+
17
+ ## Requirements
18
+
19
+ - **Platform**: macOS 14.0+ or iOS 17.0+
20
+ - **Framework**: CoreML (Apple Silicon or Intel with Neural Engine recommended)
21
+
22
+ ## Usage
23
+
24
+ ### Swift (iOS/macOS)
25
+
26
+ Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:
27
+
28
+ ```swift
29
+ let modelDir = URL(fileURLWithPath: "/path/to/models")
30
+ let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
31
+ // Encode, run LLM, diffusion, decode...
32
+ ```
33
+
34
+ ### Python (macOS only)
35
+
36
+ CoreML models require macOS to load and run:
37
+
38
+ ```bash
39
+ python inference.py --models-dir ./models --text "Hello world"
40
+ ```
41
+
42
+ ## Configuration
43
+
44
+ - **Audio**: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
45
+ - **Diffusion**: 20 steps, cosine beta schedule, v_prediction.
46
+ - **LLM**: 1536 hidden size, 28 layers (Qwen2-1.5B-based).
47
+
48
+ See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.
49
+
50
+ ## Conversion Notes
51
+
52
+ Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.
53
+
54
+ ## License
55
+
56
+ Refer to the original VibeVoice model license.