Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +115 -0
aligner/weights.safetensors +3 -0
decoder/weights.safetensors +3 -0
encoder/weights.safetensors +3 -0
model/config.json +49 -0
model/weights.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,115 @@

+---
+license: llama3.2
+library_name: mlx
+language:
+  - en
+tags:
+  - mlx
+  - tts
+  - text-to-speech
+  - speech-synthesis
+  - tada
+  - apple-silicon
+pipeline_tag: text-to-speech
+base_model: meta-llama/Llama-3.2-1B
+---
+# MLX-TADA-1B
+Pre-converted [MLX](https://github.com/ml-explore/mlx) weights for [TADA](https://github.com/HumeAI/tada) (Text-Acoustic Dual Alignment) speech synthesis on Apple Silicon.
+Built on [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B). English only.
+| Component | File | Size |
+|-----------|------|------|
+| LLM + VibeVoice head | `model/weights.safetensors` | 3.0 GB |
+| Aligner | `aligner/weights.safetensors` | 852 MB |
+| Decoder (DAC) | `decoder/weights.safetensors` | 226 MB |
+| Encoder | `encoder/weights.safetensors` | 178 MB |
+| **Total** | | **~4.3 GB** |
+All weights are stored in bfloat16 safetensors format.
+## Quick Start
+```bash
+git clone https://github.com/HumeAI/tada.git
+cd tada/apple
+uv venv && uv pip install -e .
+```
+### Option A: Download pre-converted weights (this repo)
+```python
+from huggingface_hub import snapshot_download
+snapshot_download("HumeAI/mlx-tada-1b", local_dir="./weights/1b")
+```
+Then run:
+```python
+from mlx_tada import TadaForCausalLM, save_wav
+model = TadaForCausalLM.from_weights("./weights/1b", quantize=4)
+ref = model.load_reference("speaker.wav")
+out = model.generate("Hello, this is a test of TADA speech synthesis.", ref)
+save_wav(out.audio, "output.wav")
+```
+### Option B: Use from_pretrained (auto-downloads)
+```python
+from mlx_tada import TadaForCausalLM, save_wav
+model = TadaForCausalLM.from_pretrained("HumeAI/mlx-tada-1b", quantize=4)
+ref = model.load_reference("speaker.wav")
+out = model.generate("Hello, this is a test of TADA speech synthesis.", ref)
+save_wav(out.audio, "output.wav")
+```
+### CLI
+```bash
+uv run python -m mlx_tada.generate \
+  --weights ./weights/1b \
+  --audio speaker.wav \
+  --text "Hello, this is a test of TADA speech synthesis." \
+  --quantize 4 \
+  --output output.wav
+```
+## Hardware Requirements
+| Precision | Memory |
+|-----------|--------|
+| bfloat16 (default) | ~8 GB |
+| 4-bit quantized | ~3 GB |
+Tested on Apple M1 Pro and above. 4-bit quantization is recommended for most devices — it is roughly 10x faster with 60% less memory and minimal quality loss.
+## Convert Weights Yourself
+If you prefer to convert from the original PyTorch weights (requires [gated Llama access](https://huggingface.co/meta-llama/Llama-3.2-1B)):
+```bash
+cd tada/apple
+uv pip install -e ".[convert]"
+huggingface-cli login
+uv run python -m mlx_tada.convert_1b ./weights/1b
+```
+## Related
+- [TADA GitHub](https://github.com/HumeAI/tada) — source code, PyTorch inference, training
+- [TADA Paper](https://arxiv.org/abs/2602.23068) — arxiv
+- [HumeAI/tada-1b](https://huggingface.co/HumeAI/tada-1b) — PyTorch weights
+- [HumeAI/mlx-tada-3b](https://huggingface.co/HumeAI/mlx-tada-3b) — 3B multilingual MLX weights
+- [HumeAI/tada-codec](https://huggingface.co/HumeAI/tada-codec) — shared encoder, decoder, aligner weights
+## License
+This model is built with [Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-1B) and is released under the [Llama 3.2 Community License Agreement](https://github.com/HumeAI/tada/blob/main/LICENSE).
+> Llama 3.2 is licensed under the Llama 3.2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
+Built with Llama.

aligner/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:af2e603bd1f76bf33dbaf0ebe1d65f7024641d28e2887c703eadb7a3cda1316e
+size 893830649

decoder/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:40310d1e93460f2bea9b77b83dfafe11a5ffbf5dc36224b4ca89db20c1776fcb
+size 237407562

encoder/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5732c5a73f42475f620a6a8dba36f404e58fc3b4bae1f1766503a1448e062970
+size 186606332

model/config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "acoustic_dim": 512,
+  "acoustic_from_nth_hidden_state": -1,
+  "acoustic_mean": 0.0,
+  "acoustic_std": 1.5,
+  "add_semantic_to_condition": 0.0,
+  "architectures": [
+    "TadaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 128000,
+  "bottleneck_dim": null,
+  "context_window": 8,
+  "diffusion_head_type": "vibevoice",
+  "dist_type": "fixed",
+  "dtype": "bfloat16",
+  "eos_token_id": 128001,
+  "head_dim": 64,
+  "head_ffn_ratio": 4.0,
+  "head_layers": 6,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 8192,
+  "latent_dropout": 0.0,
+  "max_position_embeddings": 131072,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 16,
+  "num_key_value_heads": 8,
+  "num_time_classes": 256,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": {
+    "factor": 32.0,
+    "high_freq_factor": 4.0,
+    "low_freq_factor": 1.0,
+    "original_max_position_embeddings": 8192,
+    "rope_type": "llama3"
+  },
+  "rope_theta": 500000.0,
+  "shift_acoustic": 5,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.3",
+  "use_cache": true,
+  "vocab_size": 128256
+}

model/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45b45dcbc3faa9efa11c6aa6a6a84290f26a1c6944b860521526be4dc30d4e4b
+size 3269784687