OpenEchoTTS β€” 50K-step checkpoint

Zero-shot, voice-cloning text-to-speech built on a flow-matching DiT over DACVAE audio latents. Given a short reference clip and a text prompt, the model generates speech in the reference speaker's voice β€” no speaker embeddings, no per-speaker finetuning.

  • Architecture: BlockDiTT5 (779.6M) + trainable byte-level text encoder (56.8M) = 836M total
  • Audio codec: DACVAE (facebook/dacvae-watermarked) β€” 128-dim latents @ 25 fps, 48 kHz
  • Training data: English podcast transcripts + Emolia emotional-speech corpus
  • Objective: flow matching with classifier-free guidance (10% text drop, 10% speaker drop)
  • Training compute: 8 nodes Γ— 8 GCDs (AMD MI250x on LUMI), global batch 2048, 50K steps

Results (LibriSpeech test-clean, cross-pair voice-cloning, 99 pairs)

Metric Value
mean WER 0.78
median WER 0.76
mean CER 0.56
ASR Whisper-large-v3
CFG scale 3.0
Sampling steps 30 (Euler ODE)

Interpretation: content-alignment is still weak at 50K steps; the model produces recognizable speech with the reference speaker's timbre, but often paraphrases or babbles parts of long prompts. This checkpoint is a research artifact, not a production TTS β€” treat it as a starting point.

Files

File Description
model.safetensors DiT decoder weights (bfloat16, 1.45 GB)
encoder.safetensors Byte text encoder weights (fp32, 0.21 GB)
config.json Model + training config + eval metrics

Usage

# 1. Clone the code repo (model + sampling loop live there)
git clone https://github.com/gijs/openechotts   # adjust URL to match the actual repo
cd openechotts

# 2. Download the weights
pip install huggingface_hub
python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download("gijs/openechotts-50k", local_dir="ckpt_hf")
PY

Minimal inference (self-contained, uses training_torchtitan/eval/eval_checkpoint.py helpers):

import numpy as np, soundfile as sf, sys, torch, torchaudio
from safetensors.torch import load_file

REPO = "/path/to/openechotts"
sys.path.insert(0, f"{REPO}/training_torchtitan")
sys.path.insert(0, f"{REPO}/training_torchtitan/eval")
sys.path.insert(0, f"{REPO}/ablation_textenc")

from model import BlockDiTT5, BlockDiTT5Config
from custom_encoder import CustomTextEncoder
from configs import CustomEncoderConfig
from eval_checkpoint import (
    load_dacvae, decode_latent, euler_sample, pad_tokens, pad_latent,
)

device = torch.device("cuda")

# Build + load DiT
import json
cfg = json.load(open("ckpt_hf/config.json"))
model_cfg = BlockDiTT5Config(**cfg["model_config"])
model = BlockDiTT5(model_cfg).to(device).eval()
model.load_state_dict({k: v.float() for k, v in load_file("ckpt_hf/model.safetensors").items()})

# Build + load byte text encoder
enc_cfg = CustomEncoderConfig(vocab_size=256, dim=768, intermediate_size=2048,
                              n_layers=8, n_heads=6, norm_eps=1e-5)
encoder = CustomTextEncoder(enc_cfg, attn_type="standard", ffn_type="standard").to(device).eval()
encoder.load_state_dict(load_file("ckpt_hf/encoder.safetensors"))

# Encode reference audio via DACVAE (20 kHz mono any length; trimmed to 512 frames / 20.5s)
dacvae = load_dacvae(device)
wav, sr = torchaudio.load("reference.wav")
if sr != 48000:
    wav = torchaudio.functional.resample(wav, sr, 48000)
if wav.shape[0] > 1:
    wav = wav.mean(0, keepdim=True)
with torch.no_grad():
    z = dacvae.encode(wav.unsqueeze(0).to(device))
    if isinstance(z, tuple): z = z[0]
    ref = z.squeeze(0).transpose(0, 1).cpu().numpy()
ref = ref[: (min(ref.shape[0], 512) // 4) * 4]

spk_lat, spk_mask = pad_latent(ref, 512)
spk_lat_t = torch.from_numpy(spk_lat).unsqueeze(0).to(device)
spk_mask_t = torch.from_numpy(spk_mask).unsqueeze(0).to(device)

# Text β†’ bytes β†’ embed
text = "Hello world, this is a zero-shot voice clone demo."
ids, tmask = pad_tokens(list(text.encode("utf-8")), 512)
ids_t = torch.from_numpy(ids).unsqueeze(0).to(device)
tmask_t = torch.from_numpy(tmask).unsqueeze(0).to(device)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    text_emb = encoder(ids_t, tmask_t)

# Sample + decode
latent = euler_sample(model, text_emb, tmask_t, spk_lat_t, spk_mask_t,
                      num_steps=30, cfg_scale=3.0, output_length=500,
                      seed=42, device=str(device))
audio = decode_latent(dacvae, latent, device)
sf.write("out.wav", audio, 48000)

See demos, emotion demos, and angry voice clone for generated samples at this checkpoint.

Training summary

Field Value
Optimizer AdamW (lr=1e-4, Ξ²=(0.9, 0.99), wd=0.01)
Schedule WSD β€” 5% warmup, 75% stable, cosine decay to 1%
Grad clip 1.0
Mixed precision bfloat16 autocast, fp32 master weights
Global batch 2048 (32 per GCD Γ— 64 GCDs)
Seq lengths latent 768, text 512, speaker 512
Text CFG drop 10%
Speaker CFG drop 10%
Steps 50,000 (resumed from a 35k checkpoint of a sibling 50k run)
Hardware 8 nodes Γ— 8 GCDs of AMD MI250x on LUMI
Framework PyTorch 2.7.1 (ROCm 6.2.4), flash-attn 2.7.3 (gfx90a), aws-ofi-rccl 1.4.0

WER progression across checkpoints:

Step median WER mean CER
30k 0.97 β€”
40k 0.90 0.98
45k 0.87 0.58
50k 0.79 0.56

Limitations

  • Content alignment is weak. Long prompts tend to be paraphrased or truncated. WER of 0.78 means roughly 1 in 5 words still match the target prompt exactly.
  • No explicit emotion conditioning. Emotion is inferred implicitly from punctuation and word choice in the prompt. The angry voice clone demo works by providing a reference clip that is already angry-sounding.
  • Fixed 20s output. Sampling loop generates a fixed 500-frame latent (~20 s audio); trailing silence/low-energy tail is trimmed before ASR.
  • Watermarked DACVAE. Generated audio inherits facebook's dacvae-watermarked inaudible watermark.

License

Apache-2.0 for the model weights and code. Training data licenses apply to the derived model; please review the TTS-AGI/podcast-tokenized-bg2.5-enj4.5 and TTS-AGI/emolia-hq-tokenized dataset cards before commercial use.

Citation

If you use this model, please cite it as:

@misc{openechotts_50k_2026,
  author = {Wijngaard, Gijs},
  title  = {OpenEchoTTS β€” 50K-step BlockDiT flow-matching TTS checkpoint},
  year   = {2026},
  url    = {https://huggingface.co/gijs/openechotts-50k},
}
Downloads last month
38
Safetensors
Model size
0.8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support