dots.tts-soar โ€” GGUF (CrispASR)

GGUF conversions of dots.tts-soar, a continuous-latent autoregressive text-to-speech model, for use with CrispASR โ€” a portable C/C++ speech engine built on ggml. No Python or PyTorch needed at inference time; runs on CPU (Metal/CUDA/Vulkan optional).

dots.tts generates 48 kHz speech patch-by-patch in a continuous latent space (no discrete audio codec). A Qwen2.5-1.5B language model drives a flow-matching DiT head that predicts acoustic latents, which a BigVGAN vocoder renders to waveform.

Files

File Size Contents Use
dots-tts-soar-f16.gguf 4.6 GB Full model, F16 Reference quality
dots-tts-soar-q8_0.gguf 3.1 GB Mixed-quant (Q8_0 LLM + PatchEncoder, F16 DiT) Recommended default
dots-tts-soar-q4_k.gguf 2.0 GB Mixed-quant (Q4_K LLM + PatchEncoder, F16 DiT) Smallest footprint
dots-tts-soar-vocoder-f16.gguf 0.36 GB BigVGAN 48 kHz vocoder Required companion
dots-tts-soar-spk-f16.gguf 0.01 GB CAM++ speaker encoder Optional (voice cloning)

Each core GGUF needs the vocoder companion. Pick one core + the vocoder.

โš ๏ธ Why the quants are "mixed"

The flow-matching DiT head is kept at F16 in every quant. It runs in a classifier-free-guidance Euler ODE loop (โ‰ˆ16 steps ร— 18 layers ร— 2 CFG = hundreds of forwards per utterance); per-step quantization noise compounds and derails generation โ€” validated: a fully-Q8 DiT pushes the flow-match cosine to ~0.994 and produces no-EOS runaway / garbled audio.

The LLM (cos 0.999 quantized) and PatchEncoder (cos 0.9999 quantized) are robust, so the quants shrink those layers while leaving the DiT, projections, embeddings and denormalization statistics at source precision. Result: q8_0 and q4_k are ASR-roundtrip-verbatim against F16, at a fraction of the size. Do not re-quantize the DiT yourself.

Usage (CrispASR)

# Build CrispASR (see repo README), then:
./build/bin/crispasr --backend dots-tts \
  -m dots-tts-soar-q8_0.gguf \
  --tts "Hello world." \
  --tts-output out.wav

The vocoder companion is auto-discovered as a sibling file next to the core model (dots-tts-soar-vocoder-*.gguf in the same directory), so no extra flag is needed. To point at a vocoder elsewhere, pass --codec-model dots-tts-soar-vocoder-f16.gguf (there is no --tts-vocoder flag). The optional CAM++ speaker encoder (dots-tts-soar-spk-f16.gguf) is discovered the same way and only used when you pass --voice for cloning.

Or let the model registry fetch the default (F16 core + vocoder) automatically with -m auto --backend dots-tts.

Tuning knobs (env)

Variable Default Effect
CRISPASR_DOTS_MAX_PATCHES 200 Hard cap on generated audio patches
CRISPASR_DOTS_ODE_STEPS 16 Flow-matching Euler steps (quality vs speed)
CRISPASR_DOTS_EOS_THRESHOLD 0.5 End-of-speech probability threshold

Architecture

  • LLM: Qwen2.5-1.5B (28 layers, 12 Q / 2 KV heads, hidden 1536, head_dim 128, RoPE ฮธ=1e6) โ€” autoregressive driver.
  • PatchEncoder: 24-layer VAE semantic encoder (RMSNorm, no RoPE, no QK-norm) โ€” re-encodes generated latents back into the LLM stream.
  • DiT flow-matching head: 18 layers (RoPE + AdaLN), CFG Euler ODE โ€” predicts continuous acoustic latents. Kept at F16.
  • Vocoder: BigVGAN, 48 kHz output.
  • Speaker: CAM++ (optional, for voice conditioning).

patch_size=4, latent_dim=128, out_ds_rate=2, dit_dim=1024.

License

Apache-2.0, inherited from the upstream dots/dots.tts-soar model.

Conversion

Converted and validated stage-by-stage (per-component cosine against the PyTorch reference) with the CrispASR diff harness. Mixed-quant produced by crispasr-quantize, which automatically preserves the DiT and all sampling-critical tensors at F16.

Downloads last month
569
GGUF
Model size
2B params
Architecture
dots-tts
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support