OpenEchoTTS — 50K-step checkpoint

Zero-shot, voice-cloning text-to-speech built on a flow-matching DiT over DACVAE audio latents. Given a short reference clip and a text prompt, the model generates speech in the reference speaker's voice — no speaker embeddings, no per-speaker finetuning.

Architecture: BlockDiTT5 (779.6M) + trainable byte-level text encoder (56.8M) = 836M total
Audio codec: DACVAE (facebook/dacvae-watermarked) — 128-dim latents @ 25 fps, 48 kHz
Training data: English podcast transcripts + Emolia emotional-speech corpus
Objective: flow matching with classifier-free guidance (10% text drop, 10% speaker drop)
Training compute: 8 nodes × 8 GCDs (AMD MI250x on LUMI), global batch 2048, 50K steps

Results (LibriSpeech test-clean, cross-pair voice-cloning, 99 pairs)

Metric	Value
mean WER	0.78
median WER	0.76
mean CER	0.56
ASR	Whisper-large-v3
CFG scale	3.0
Sampling steps	30 (Euler ODE)

Interpretation: content-alignment is still weak at 50K steps; the model produces recognizable speech with the reference speaker's timbre, but often paraphrases or babbles parts of long prompts. This checkpoint is a research artifact, not a production TTS — treat it as a starting point.

Files

File	Description
`model.safetensors`	DiT decoder weights (bfloat16, 1.45 GB)
`encoder.safetensors`	Byte text encoder weights (fp32, 0.21 GB)
`config.json`	Model + training config + eval metrics

Usage

# 1. Clone the code repo (model + sampling loop live there)
git clone https://github.com/gijs/openechotts   # adjust URL to match the actual repo
cd openechotts

# 2. Download the weights
pip install huggingface_hub
python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download("gijs/openechotts-50k", local_dir="ckpt_hf")
PY

Minimal inference (self-contained, uses training_torchtitan/eval/eval_checkpoint.py helpers):

import numpy as np, soundfile as sf, sys, torch, torchaudio
from safetensors.torch import load_file

REPO = "/path/to/openechotts"
sys.path.insert(0, f"{REPO}/training_torchtitan")
sys.path.insert(0, f"{REPO}/training_torchtitan/eval")
sys.path.insert(0, f"{REPO}/ablation_textenc")

from model import BlockDiTT5, BlockDiTT5Config
from custom_encoder import CustomTextEncoder
from configs import CustomEncoderConfig
from eval_checkpoint import (
    load_dacvae, decode_latent, euler_sample, pad_tokens, pad_latent,
)

device = torch.device("cuda")

# Build + load DiT
import json
cfg = json.load(open("ckpt_hf/config.json"))
model_cfg = BlockDiTT5Config(**cfg["model_config"])
model = BlockDiTT5(model_cfg).to(device).eval()
model.load_state_dict({k: v.float() for k, v in load_file("ckpt_hf/model.safetensors").items()})

# Build + load byte text encoder
enc_cfg = CustomEncoderConfig(vocab_size=256, dim=768, intermediate_size=2048,
                              n_layers=8, n_heads=6, norm_eps=1e-5)
encoder = CustomTextEncoder(enc_cfg, attn_type="standard", ffn_type="standard").to(device).eval()
encoder.load_state_dict(load_file("ckpt_hf/encoder.safetensors"))

# Encode reference audio via DACVAE (20 kHz mono any length; trimmed to 512 frames / 20.5s)
dacvae = load_dacvae(device)
wav, sr = torchaudio.load("reference.wav")
if sr != 48000:
    wav = torchaudio.functional.resample(wav, sr, 48000)
if wav.shape[0] > 1:
    wav = wav.mean(0, keepdim=True)
with torch.no_grad():
    z = dacvae.encode(wav.unsqueeze(0).to(device))
    if isinstance(z, tuple): z = z[0]
    ref = z.squeeze(0).transpose(0, 1).cpu().numpy()
ref = ref[: (min(ref.shape[0], 512) // 4) * 4]

spk_lat, spk_mask = pad_latent(ref, 512)
spk_lat_t = torch.from_numpy(spk_lat).unsqueeze(0).to(device)
spk_mask_t = torch.from_numpy(spk_mask).unsqueeze(0).to(device)

# Text → bytes → embed
text = "Hello world, this is a zero-shot voice clone demo."
ids, tmask = pad_tokens(list(text.encode("utf-8")), 512)
ids_t = torch.from_numpy(ids).unsqueeze(0).to(device)
tmask_t = torch.from_numpy(tmask).unsqueeze(0).to(device)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    text_emb = encoder(ids_t, tmask_t)

# Sample + decode
latent = euler_sample(model, text_emb, tmask_t, spk_lat_t, spk_mask_t,
                      num_steps=30, cfg_scale=3.0, output_length=500,
                      seed=42, device=str(device))
audio = decode_latent(dacvae, latent, device)
sf.write("out.wav", audio, 48000)

See demos, emotion demos, and angry voice clone for generated samples at this checkpoint.

Training summary

Field	Value
Optimizer	AdamW (lr=1e-4, β=(0.9, 0.99), wd=0.01)
Schedule	WSD — 5% warmup, 75% stable, cosine decay to 1%
Grad clip	1.0
Mixed precision	bfloat16 autocast, fp32 master weights
Global batch	2048 (32 per GCD × 64 GCDs)
Seq lengths	latent 768, text 512, speaker 512
Text CFG drop	10%
Speaker CFG drop	10%
Steps	50,000 (resumed from a 35k checkpoint of a sibling 50k run)
Hardware	8 nodes × 8 GCDs of AMD MI250x on LUMI
Framework	PyTorch 2.7.1 (ROCm 6.2.4), flash-attn 2.7.3 (gfx90a), aws-ofi-rccl 1.4.0

WER progression across checkpoints:

Step	median WER	mean CER
30k	0.97	—
40k	0.90	0.98
45k	0.87	0.58
50k	0.79	0.56

Limitations

Content alignment is weak. Long prompts tend to be paraphrased or truncated. WER of 0.78 means roughly 1 in 5 words still match the target prompt exactly.
No explicit emotion conditioning. Emotion is inferred implicitly from punctuation and word choice in the prompt. The angry voice clone demo works by providing a reference clip that is already angry-sounding.
Fixed 20s output. Sampling loop generates a fixed 500-frame latent (~20 s audio); trailing silence/low-energy tail is trimmed before ASR.
Watermarked DACVAE. Generated audio inherits facebook's dacvae-watermarked inaudible watermark.

License

Apache-2.0 for the model weights and code. Training data licenses apply to the derived model; please review the TTS-AGI/podcast-tokenized-bg2.5-enj4.5 and TTS-AGI/emolia-hq-tokenized dataset cards before commercial use.

Citation

If you use this model, please cite it as:

@misc{openechotts_50k_2026,
  author = {Wijngaard, Gijs},
  title  = {OpenEchoTTS — 50K-step BlockDiT flow-matching TTS checkpoint},
  year   = {2026},
  url    = {https://huggingface.co/gijs/openechotts-50k},
}

Downloads last month: 1

Safetensors

Model size

0.8B params

Tensor type

BF16