TADA Encoder & Aligner β€” GGUF

GGUF conversions of the HumeAI/tada-codec encoder and aligner components for use with CrispASR's --make-ref pipeline.

These models enable voice reference creation directly in C++ β€” converting a WAV file + transcript into a voice reference GGUF that can be used with the TADA TTS backend for voice cloning.

Files

File Size Description
tada-encoder-f16.gguf 178 MB Shared encoder: WavEncoder (DAC-style conv, 480x downsample) + 6-layer LocalAttentionEncoder (RoPE, segment mask) + hidden linear (1024β†’512)
tada-aligner-en.gguf 1.1 GB English aligner: wav2vec2-large (24 layers, 1024-d) + 128K-class Llama-3.2 CTC head for text-audio alignment

Architecture

The TADA encoder pipeline converts audio + transcript into aligned acoustic features (voice fingerprint):

Audio (24kHz) ─┬─► WavEncoder (strided conv, 480x downsample) ─► 50Hz features (1024-d)
               β”‚                                                        β”‚
               β”‚                                              + pos_emb(token_masks)
               β”‚                                                        β”‚
               β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚   β–Ό
               β”‚   LocalAttentionEncoder (6 layers, RoPE, v2 segment mask)
               β”‚                      β”‚
               β”‚              hidden_linear (1024β†’512)
               β”‚                      β”‚
               β”‚              post-process (zero, noise, gather, normalize)
               β”‚                      β”‚
               β”‚                      β–Ό
               β”‚              token_values (N Γ— 512)  ──► voice reference GGUF
               β”‚
               └─► Resample 16kHz ─► Aligner (wav2vec2-large CTC) ─► DP alignment
                                                                           β”‚
                                                                    token_positions (N,)
                                                                    token_masks (T,)

Encoder details

  • WavEncoder: Conv1d(1β†’64, k=7) β†’ 4Γ— EncoderBlock (strides [6,5,4,4], Snake1d + weight-normed convs) β†’ Snake1d β†’ Conv1d(1024β†’1024, k=3). Total 480Γ— downsample: 24kHz β†’ 50Hz.
  • LocalAttentionEncoder: 6 layers, 1024-d, 8 heads (head_dim=128), RoPE (ΞΈ=10000), GELU FFN (4096), v2 block-attention segment mask, post-norm.
  • hidden_linear: Linear(1024β†’512) projects to acoustic embedding space.

Aligner details

  • Base: wav2vec2-large architecture (24 transformer layers, 1024 hidden, 16 attention heads)
  • CTC head: 128,256 output classes (Llama-3.2 tokenizer vocabulary)
  • CNN: 7-layer feature extractor (group-norm variant, strides [5,2,2,2,2,2,2])
  • Positional conv: K=128, groups=16 (weight-norm materialized)
  • Alignment: DP algorithm finds optimal monotonic text-to-audio alignment

Usage with CrispASR

Creating a voice reference (current β€” Python)

python models/convert-tada-ref-to-gguf.py \
    --audio speaker.wav \
    --transcript "Exact words spoken in the audio." \
    --output tada-ref-custom.gguf

Creating a voice reference (planned β€” C++)

crispasr --backend tada-3b-ml --make-ref \
    --voice speaker.wav \
    --ref-text "Exact words spoken in the audio." \
    --make-ref-output tada-ref-custom.gguf

Using the voice reference for TTS

crispasr --backend tada-3b-ml -m auto \
    --voice tada-ref-custom.gguf \
    --tts "Hello, this is my cloned voice." \
    --tts-output output.wav \
    --i-have-rights

Parity testing

These GGUFs are validated against the Python reference using the crispasr-diff harness:

# Generate Python reference dump
TADA_ENCODER_TEXT="Please call Stella." \
TADA_CODEC_DIR=/path/to/tada-codec \
python tools/dump_reference.py --backend tada-encoder \
    --model-dir HumeAI/tada-codec \
    --audio samples/jfk.wav \
    --output ref.gguf

# Compare C++ output against reference
crispasr-diff tada-encoder tada-encoder-f16.gguf ref.gguf samples/jfk.wav

Supported languages

The aligner has language-specific variants for non-English alignment. Currently only the English aligner is provided here. Language-specific aligners (ar, ch, de, es, fr, it, ja, pl, pt) can be converted with:

python models/convert-tada-aligner-to-gguf.py \
    --codec-repo HumeAI/tada-codec \
    --language fr \
    --output tada-aligner-fr.gguf

License

These weights are derived from HumeAI/tada-codec which uses the Llama 3.2 Community License Agreement.

Conversion

Converted with CrispASR's GGUF converters:

# Encoder (shared, all languages)
python models/convert-tada-encoder-to-gguf.py \
    --input HumeAI/tada-codec \
    --output tada-encoder-f16.gguf

# Aligner (English)
python models/convert-tada-aligner-to-gguf.py \
    --codec-repo HumeAI/tada-codec \
    --output tada-aligner-en.gguf
Downloads last month
143
GGUF
Model size
0.4B params
Architecture
wav2vec2
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/tada-encoder-GGUF

Quantized
(1)
this model