TADA Encoder & Aligner β GGUF
GGUF conversions of the HumeAI/tada-codec
encoder and aligner components for use with
CrispASR's --make-ref pipeline.
These models enable voice reference creation directly in C++ β converting a WAV file + transcript into a voice reference GGUF that can be used with the TADA TTS backend for voice cloning.
Files
| File | Size | Description |
|---|---|---|
tada-encoder-f16.gguf |
178 MB | Shared encoder: WavEncoder (DAC-style conv, 480x downsample) + 6-layer LocalAttentionEncoder (RoPE, segment mask) + hidden linear (1024β512) |
tada-aligner-en.gguf |
1.1 GB | English aligner: wav2vec2-large (24 layers, 1024-d) + 128K-class Llama-3.2 CTC head for text-audio alignment |
Architecture
The TADA encoder pipeline converts audio + transcript into aligned acoustic features (voice fingerprint):
Audio (24kHz) ββ¬ββΊ WavEncoder (strided conv, 480x downsample) ββΊ 50Hz features (1024-d)
β β
β + pos_emb(token_masks)
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β βΌ
β LocalAttentionEncoder (6 layers, RoPE, v2 segment mask)
β β
β hidden_linear (1024β512)
β β
β post-process (zero, noise, gather, normalize)
β β
β βΌ
β token_values (N Γ 512) βββΊ voice reference GGUF
β
βββΊ Resample 16kHz ββΊ Aligner (wav2vec2-large CTC) ββΊ DP alignment
β
token_positions (N,)
token_masks (T,)
Encoder details
- WavEncoder: Conv1d(1β64, k=7) β 4Γ EncoderBlock (strides [6,5,4,4], Snake1d + weight-normed convs) β Snake1d β Conv1d(1024β1024, k=3). Total 480Γ downsample: 24kHz β 50Hz.
- LocalAttentionEncoder: 6 layers, 1024-d, 8 heads (head_dim=128), RoPE (ΞΈ=10000), GELU FFN (4096), v2 block-attention segment mask, post-norm.
- hidden_linear: Linear(1024β512) projects to acoustic embedding space.
Aligner details
- Base: wav2vec2-large architecture (24 transformer layers, 1024 hidden, 16 attention heads)
- CTC head: 128,256 output classes (Llama-3.2 tokenizer vocabulary)
- CNN: 7-layer feature extractor (group-norm variant, strides [5,2,2,2,2,2,2])
- Positional conv: K=128, groups=16 (weight-norm materialized)
- Alignment: DP algorithm finds optimal monotonic text-to-audio alignment
Usage with CrispASR
Creating a voice reference (current β Python)
python models/convert-tada-ref-to-gguf.py \
--audio speaker.wav \
--transcript "Exact words spoken in the audio." \
--output tada-ref-custom.gguf
Creating a voice reference (planned β C++)
crispasr --backend tada-3b-ml --make-ref \
--voice speaker.wav \
--ref-text "Exact words spoken in the audio." \
--make-ref-output tada-ref-custom.gguf
Using the voice reference for TTS
crispasr --backend tada-3b-ml -m auto \
--voice tada-ref-custom.gguf \
--tts "Hello, this is my cloned voice." \
--tts-output output.wav \
--i-have-rights
Parity testing
These GGUFs are validated against the Python reference using the crispasr-diff harness:
# Generate Python reference dump
TADA_ENCODER_TEXT="Please call Stella." \
TADA_CODEC_DIR=/path/to/tada-codec \
python tools/dump_reference.py --backend tada-encoder \
--model-dir HumeAI/tada-codec \
--audio samples/jfk.wav \
--output ref.gguf
# Compare C++ output against reference
crispasr-diff tada-encoder tada-encoder-f16.gguf ref.gguf samples/jfk.wav
Supported languages
The aligner has language-specific variants for non-English alignment. Currently only the English aligner is provided here. Language-specific aligners (ar, ch, de, es, fr, it, ja, pl, pt) can be converted with:
python models/convert-tada-aligner-to-gguf.py \
--codec-repo HumeAI/tada-codec \
--language fr \
--output tada-aligner-fr.gguf
License
These weights are derived from HumeAI/tada-codec which uses the Llama 3.2 Community License Agreement.
Conversion
Converted with CrispASR's GGUF converters:
# Encoder (shared, all languages)
python models/convert-tada-encoder-to-gguf.py \
--input HumeAI/tada-codec \
--output tada-encoder-f16.gguf
# Aligner (English)
python models/convert-tada-aligner-to-gguf.py \
--codec-repo HumeAI/tada-codec \
--output tada-aligner-en.gguf
- Downloads last month
- 143
16-bit
Model tree for cstr/tada-encoder-GGUF
Base model
HumeAI/tada-codec