TADA Encoder & Aligner — GGUF

GGUF conversions of the HumeAI/tada-codec encoder and aligner components for use with CrispASR's --make-ref pipeline.

These models enable voice reference creation directly in C++ — converting a WAV file + transcript into a voice reference GGUF that can be used with the TADA TTS backend for voice cloning.

Files

File	Size	Description
`tada-encoder-f16.gguf`	178 MB	Shared encoder: WavEncoder (DAC-style conv, 480x downsample) + 6-layer LocalAttentionEncoder (RoPE, segment mask) + hidden linear (1024→512)
`tada-aligner-en.gguf`	1.1 GB	English aligner: wav2vec2-large (24 layers, 1024-d) + 128K-class Llama-3.2 CTC head for text-audio alignment

Architecture

The TADA encoder pipeline converts audio + transcript into aligned acoustic features (voice fingerprint):

Audio (24kHz) ─┬─► WavEncoder (strided conv, 480x downsample) ─► 50Hz features (1024-d)
               │                                                        │
               │                                              + pos_emb(token_masks)
               │                                                        │
               │   ┌─────────────────────────────────────────────────────┘
               │   ▼
               │   LocalAttentionEncoder (6 layers, RoPE, v2 segment mask)
               │                      │
               │              hidden_linear (1024→512)
               │                      │
               │              post-process (zero, noise, gather, normalize)
               │                      │
               │                      ▼
               │              token_values (N × 512)  ──► voice reference GGUF
               │
               └─► Resample 16kHz ─► Aligner (wav2vec2-large CTC) ─► DP alignment
                                                                           │
                                                                    token_positions (N,)
                                                                    token_masks (T,)

Encoder details

WavEncoder: Conv1d(1→64, k=7) → 4× EncoderBlock (strides [6,5,4,4], Snake1d + weight-normed convs) → Snake1d → Conv1d(1024→1024, k=3). Total 480× downsample: 24kHz → 50Hz.
LocalAttentionEncoder: 6 layers, 1024-d, 8 heads (head_dim=128), RoPE (θ=10000), GELU FFN (4096), v2 block-attention segment mask, post-norm.
hidden_linear: Linear(1024→512) projects to acoustic embedding space.

Aligner details

Base: wav2vec2-large architecture (24 transformer layers, 1024 hidden, 16 attention heads)
CTC head: 128,256 output classes (Llama-3.2 tokenizer vocabulary)
CNN: 7-layer feature extractor (group-norm variant, strides [5,2,2,2,2,2,2])
Positional conv: K=128, groups=16 (weight-norm materialized)
Alignment: DP algorithm finds optimal monotonic text-to-audio alignment

Usage with CrispASR

Creating a voice reference (current — Python)

python models/convert-tada-ref-to-gguf.py \
    --audio speaker.wav \
    --transcript "Exact words spoken in the audio." \
    --output tada-ref-custom.gguf

Creating a voice reference (planned — C++)

crispasr --backend tada-3b-ml --make-ref \
    --voice speaker.wav \
    --ref-text "Exact words spoken in the audio." \
    --make-ref-output tada-ref-custom.gguf

Using the voice reference for TTS

crispasr --backend tada-3b-ml -m auto \
    --voice tada-ref-custom.gguf \
    --tts "Hello, this is my cloned voice." \
    --tts-output output.wav \
    --i-have-rights

Parity testing

These GGUFs are validated against the Python reference using the crispasr-diff harness:

# Generate Python reference dump
TADA_ENCODER_TEXT="Please call Stella." \
TADA_CODEC_DIR=/path/to/tada-codec \
python tools/dump_reference.py --backend tada-encoder \
    --model-dir HumeAI/tada-codec \
    --audio samples/jfk.wav \
    --output ref.gguf

# Compare C++ output against reference
crispasr-diff tada-encoder tada-encoder-f16.gguf ref.gguf samples/jfk.wav

Supported languages

The aligner has language-specific variants for non-English alignment. Currently only the English aligner is provided here. Language-specific aligners (ar, ch, de, es, fr, it, ja, pl, pt) can be converted with:

python models/convert-tada-aligner-to-gguf.py \
    --codec-repo HumeAI/tada-codec \
    --language fr \
    --output tada-aligner-fr.gguf

License

These weights are derived from HumeAI/tada-codec which uses the Llama 3.2 Community License Agreement.

Conversion

Converted with CrispASR's GGUF converters:

# Encoder (shared, all languages)
python models/convert-tada-encoder-to-gguf.py \
    --input HumeAI/tada-codec \
    --output tada-encoder-f16.gguf

# Aligner (English)
python models/convert-tada-aligner-to-gguf.py \
    --codec-repo HumeAI/tada-codec \
    --output tada-aligner-en.gguf

Downloads last month: 143

GGUF

Model size

0.4B params

Architecture

wav2vec2

Hardware compatibility

16-bit

View +1 variant

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstr/tada-encoder-GGUF

Base model

HumeAI/tada-codec

Quantized

(1)

this model