sharath25 commited on
Commit
14c50fe
·
verified ·
1 Parent(s): 466490d

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.2
3
+ library_name: mlx
4
+ language:
5
+ - en
6
+ tags:
7
+ - mlx
8
+ - tts
9
+ - text-to-speech
10
+ - speech-synthesis
11
+ - tada
12
+ - apple-silicon
13
+ pipeline_tag: text-to-speech
14
+ base_model: meta-llama/Llama-3.2-1B
15
+ ---
16
+
17
+ # MLX-TADA-1B
18
+
19
+ Pre-converted [MLX](https://github.com/ml-explore/mlx) weights for [TADA](https://github.com/HumeAI/tada) (Text-Acoustic Dual Alignment) speech synthesis on Apple Silicon.
20
+
21
+ Built on [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B). English only.
22
+
23
+ | Component | File | Size |
24
+ |-----------|------|------|
25
+ | LLM + VibeVoice head | `model/weights.safetensors` | 3.0 GB |
26
+ | Aligner | `aligner/weights.safetensors` | 852 MB |
27
+ | Decoder (DAC) | `decoder/weights.safetensors` | 226 MB |
28
+ | Encoder | `encoder/weights.safetensors` | 178 MB |
29
+ | **Total** | | **~4.3 GB** |
30
+
31
+ All weights are stored in bfloat16 safetensors format.
32
+
33
+ ## Quick Start
34
+
35
+ ```bash
36
+ git clone https://github.com/HumeAI/tada.git
37
+ cd tada/apple
38
+ uv venv && uv pip install -e .
39
+ ```
40
+
41
+ ### Option A: Download pre-converted weights (this repo)
42
+
43
+ ```python
44
+ from huggingface_hub import snapshot_download
45
+ snapshot_download("HumeAI/mlx-tada-1b", local_dir="./weights/1b")
46
+ ```
47
+
48
+ Then run:
49
+
50
+ ```python
51
+ from mlx_tada import TadaForCausalLM, save_wav
52
+
53
+ model = TadaForCausalLM.from_weights("./weights/1b", quantize=4)
54
+ ref = model.load_reference("speaker.wav")
55
+ out = model.generate("Hello, this is a test of TADA speech synthesis.", ref)
56
+ save_wav(out.audio, "output.wav")
57
+ ```
58
+
59
+ ### Option B: Use from_pretrained (auto-downloads)
60
+
61
+ ```python
62
+ from mlx_tada import TadaForCausalLM, save_wav
63
+
64
+ model = TadaForCausalLM.from_pretrained("HumeAI/mlx-tada-1b", quantize=4)
65
+ ref = model.load_reference("speaker.wav")
66
+ out = model.generate("Hello, this is a test of TADA speech synthesis.", ref)
67
+ save_wav(out.audio, "output.wav")
68
+ ```
69
+
70
+ ### CLI
71
+
72
+ ```bash
73
+ uv run python -m mlx_tada.generate \
74
+ --weights ./weights/1b \
75
+ --audio speaker.wav \
76
+ --text "Hello, this is a test of TADA speech synthesis." \
77
+ --quantize 4 \
78
+ --output output.wav
79
+ ```
80
+
81
+ ## Hardware Requirements
82
+
83
+ | Precision | Memory |
84
+ |-----------|--------|
85
+ | bfloat16 (default) | ~8 GB |
86
+ | 4-bit quantized | ~3 GB |
87
+
88
+ Tested on Apple M1 Pro and above. 4-bit quantization is recommended for most devices — it is roughly 10x faster with 60% less memory and minimal quality loss.
89
+
90
+ ## Convert Weights Yourself
91
+
92
+ If you prefer to convert from the original PyTorch weights (requires [gated Llama access](https://huggingface.co/meta-llama/Llama-3.2-1B)):
93
+
94
+ ```bash
95
+ cd tada/apple
96
+ uv pip install -e ".[convert]"
97
+ huggingface-cli login
98
+ uv run python -m mlx_tada.convert_1b ./weights/1b
99
+ ```
100
+
101
+ ## Related
102
+
103
+ - [TADA GitHub](https://github.com/HumeAI/tada) — source code, PyTorch inference, training
104
+ - [TADA Paper](https://arxiv.org/abs/2602.23068) — arxiv
105
+ - [HumeAI/tada-1b](https://huggingface.co/HumeAI/tada-1b) — PyTorch weights
106
+ - [HumeAI/mlx-tada-3b](https://huggingface.co/HumeAI/mlx-tada-3b) — 3B multilingual MLX weights
107
+ - [HumeAI/tada-codec](https://huggingface.co/HumeAI/tada-codec) — shared encoder, decoder, aligner weights
108
+
109
+ ## License
110
+
111
+ This model is built with [Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-1B) and is released under the [Llama 3.2 Community License Agreement](https://github.com/HumeAI/tada/blob/main/LICENSE).
112
+
113
+ > Llama 3.2 is licensed under the Llama 3.2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
114
+
115
+ Built with Llama.
aligner/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af2e603bd1f76bf33dbaf0ebe1d65f7024641d28e2887c703eadb7a3cda1316e
3
+ size 893830649
decoder/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40310d1e93460f2bea9b77b83dfafe11a5ffbf5dc36224b4ca89db20c1776fcb
3
+ size 237407562
encoder/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5732c5a73f42475f620a6a8dba36f404e58fc3b4bae1f1766503a1448e062970
3
+ size 186606332
model/config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "acoustic_dim": 512,
3
+ "acoustic_from_nth_hidden_state": -1,
4
+ "acoustic_mean": 0.0,
5
+ "acoustic_std": 1.5,
6
+ "add_semantic_to_condition": 0.0,
7
+ "architectures": [
8
+ "TadaForCausalLM"
9
+ ],
10
+ "attention_bias": false,
11
+ "attention_dropout": 0.0,
12
+ "bos_token_id": 128000,
13
+ "bottleneck_dim": null,
14
+ "context_window": 8,
15
+ "diffusion_head_type": "vibevoice",
16
+ "dist_type": "fixed",
17
+ "dtype": "bfloat16",
18
+ "eos_token_id": 128001,
19
+ "head_dim": 64,
20
+ "head_ffn_ratio": 4.0,
21
+ "head_layers": 6,
22
+ "hidden_act": "silu",
23
+ "hidden_size": 2048,
24
+ "initializer_range": 0.02,
25
+ "intermediate_size": 8192,
26
+ "latent_dropout": 0.0,
27
+ "max_position_embeddings": 131072,
28
+ "mlp_bias": false,
29
+ "model_type": "llama",
30
+ "num_attention_heads": 32,
31
+ "num_hidden_layers": 16,
32
+ "num_key_value_heads": 8,
33
+ "num_time_classes": 256,
34
+ "pretraining_tp": 1,
35
+ "rms_norm_eps": 1e-05,
36
+ "rope_scaling": {
37
+ "factor": 32.0,
38
+ "high_freq_factor": 4.0,
39
+ "low_freq_factor": 1.0,
40
+ "original_max_position_embeddings": 8192,
41
+ "rope_type": "llama3"
42
+ },
43
+ "rope_theta": 500000.0,
44
+ "shift_acoustic": 5,
45
+ "tie_word_embeddings": true,
46
+ "transformers_version": "4.57.3",
47
+ "use_cache": true,
48
+ "vocab_size": 128256
49
+ }
model/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45b45dcbc3faa9efa11c6aa6a6a84290f26a1c6944b860521526be4dc30d4e4b
3
+ size 3269784687