aoiandroid
/

vibevoice-1.5-coreml

Model card Files Files and versions

vibevoice-1.5-coreml / README.md

aoiandroid's picture

Upload README.md with huggingface_hub

0d982a6 verified 23 days ago

|

history blame contribute delete

2.25 kB

	# VibeVoice 1.5 CoreML

	VibeVoice 1.5B text-to-speech model converted to Apple CoreML format for on-device inference on macOS and iOS.

	## Model Components

	\| Component \| File \| Description \|
	\|-----------\|------\|-------------\|
	\| Acoustic Encoder \| `vibevoice_acoustic_encoder.mlpackage` \| Encodes audio to acoustic latent (fixed input: 24000 samples @ 24kHz) \|
	\| Acoustic Decoder \| `vibevoice_acoustic_decoder.mlpackage` \| Decodes acoustic latent to audio \|
	\| Semantic Encoder \| `vibevoice_semantic_encoder.mlpackage` \| Encodes audio to semantic latent (fixed input: 24000 samples) \|
	\| Acoustic Connector \| `vibevoice_acoustic_connector.mlpackage` \| Projects acoustic latent to LLM hidden space \|
	\| Semantic Connector \| `vibevoice_semantic_connector.mlpackage` \| Projects semantic latent to LLM hidden space \|
	\| LLM \| `vibevoice_llm.mlpackage` \| Qwen2-1.5B-based language model \|
	\| Diffusion Head \| `vibevoice_diffusion_head.mlpackage` \| Single-step diffusion denoising \|

	## Requirements

	- Platform: macOS 14.0+ or iOS 17.0+
	- Framework: CoreML (Apple Silicon or Intel with Neural Engine recommended)

	## Usage

	### Swift (iOS/macOS)

	Use `VibeVoicePipeline.swift` with a directory containing all `.mlpackage` files:

	```swift
	let modelDir = URL(fileURLWithPath: "/path/to/models")
	let pipeline = try VibeVoicePipeline(modelDirectory: modelDir)
	// Encode, run LLM, diffusion, decode...
	```

	### Python (macOS only)

	CoreML models require macOS to load and run:

	```bash
	python inference.py --models-dir ./models --text "Hello world"
	```

	## Configuration

	- Audio: 24 kHz, 1 channel; encoder input fixed at 24000 samples (1 second). Trim or pad input before encoding.
	- Diffusion: 20 steps, cosine beta schedule, v_prediction.
	- LLM: 1536 hidden size, 28 layers (Qwen2-1.5B-based).

	See `vibevoice_config.json` and `vibevoice_pipeline_config.json` for full settings.

	## Conversion Notes

	Conversion was done with coremltools 9.0. Acoustic and semantic encoders use fixed-length (24000 samples) inputs; the LLM was exported via `torch.export` + custom op registration. See `CONVERSION_RESULTS.md` for details.

	## License

	Refer to the original VibeVoice model license.