OpenEchoTTS β 50K-step checkpoint
Zero-shot, voice-cloning text-to-speech built on a flow-matching DiT over DACVAE audio latents. Given a short reference clip and a text prompt, the model generates speech in the reference speaker's voice β no speaker embeddings, no per-speaker finetuning.
- Architecture: BlockDiTT5 (779.6M) + trainable byte-level text encoder (56.8M) = 836M total
- Audio codec: DACVAE (
facebook/dacvae-watermarked) β 128-dim latents @ 25 fps, 48 kHz - Training data: English podcast transcripts + Emolia emotional-speech corpus
- Objective: flow matching with classifier-free guidance (10% text drop, 10% speaker drop)
- Training compute: 8 nodes Γ 8 GCDs (AMD MI250x on LUMI), global batch 2048, 50K steps
Results (LibriSpeech test-clean, cross-pair voice-cloning, 99 pairs)
| Metric | Value |
|---|---|
| mean WER | 0.78 |
| median WER | 0.76 |
| mean CER | 0.56 |
| ASR | Whisper-large-v3 |
| CFG scale | 3.0 |
| Sampling steps | 30 (Euler ODE) |
Interpretation: content-alignment is still weak at 50K steps; the model produces recognizable speech with the reference speaker's timbre, but often paraphrases or babbles parts of long prompts. This checkpoint is a research artifact, not a production TTS β treat it as a starting point.
Files
| File | Description |
|---|---|
model.safetensors |
DiT decoder weights (bfloat16, 1.45 GB) |
encoder.safetensors |
Byte text encoder weights (fp32, 0.21 GB) |
config.json |
Model + training config + eval metrics |
Usage
# 1. Clone the code repo (model + sampling loop live there)
git clone https://github.com/gijs/openechotts # adjust URL to match the actual repo
cd openechotts
# 2. Download the weights
pip install huggingface_hub
python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download("gijs/openechotts-50k", local_dir="ckpt_hf")
PY
Minimal inference (self-contained, uses training_torchtitan/eval/eval_checkpoint.py helpers):
import numpy as np, soundfile as sf, sys, torch, torchaudio
from safetensors.torch import load_file
REPO = "/path/to/openechotts"
sys.path.insert(0, f"{REPO}/training_torchtitan")
sys.path.insert(0, f"{REPO}/training_torchtitan/eval")
sys.path.insert(0, f"{REPO}/ablation_textenc")
from model import BlockDiTT5, BlockDiTT5Config
from custom_encoder import CustomTextEncoder
from configs import CustomEncoderConfig
from eval_checkpoint import (
load_dacvae, decode_latent, euler_sample, pad_tokens, pad_latent,
)
device = torch.device("cuda")
# Build + load DiT
import json
cfg = json.load(open("ckpt_hf/config.json"))
model_cfg = BlockDiTT5Config(**cfg["model_config"])
model = BlockDiTT5(model_cfg).to(device).eval()
model.load_state_dict({k: v.float() for k, v in load_file("ckpt_hf/model.safetensors").items()})
# Build + load byte text encoder
enc_cfg = CustomEncoderConfig(vocab_size=256, dim=768, intermediate_size=2048,
n_layers=8, n_heads=6, norm_eps=1e-5)
encoder = CustomTextEncoder(enc_cfg, attn_type="standard", ffn_type="standard").to(device).eval()
encoder.load_state_dict(load_file("ckpt_hf/encoder.safetensors"))
# Encode reference audio via DACVAE (20 kHz mono any length; trimmed to 512 frames / 20.5s)
dacvae = load_dacvae(device)
wav, sr = torchaudio.load("reference.wav")
if sr != 48000:
wav = torchaudio.functional.resample(wav, sr, 48000)
if wav.shape[0] > 1:
wav = wav.mean(0, keepdim=True)
with torch.no_grad():
z = dacvae.encode(wav.unsqueeze(0).to(device))
if isinstance(z, tuple): z = z[0]
ref = z.squeeze(0).transpose(0, 1).cpu().numpy()
ref = ref[: (min(ref.shape[0], 512) // 4) * 4]
spk_lat, spk_mask = pad_latent(ref, 512)
spk_lat_t = torch.from_numpy(spk_lat).unsqueeze(0).to(device)
spk_mask_t = torch.from_numpy(spk_mask).unsqueeze(0).to(device)
# Text β bytes β embed
text = "Hello world, this is a zero-shot voice clone demo."
ids, tmask = pad_tokens(list(text.encode("utf-8")), 512)
ids_t = torch.from_numpy(ids).unsqueeze(0).to(device)
tmask_t = torch.from_numpy(tmask).unsqueeze(0).to(device)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
text_emb = encoder(ids_t, tmask_t)
# Sample + decode
latent = euler_sample(model, text_emb, tmask_t, spk_lat_t, spk_mask_t,
num_steps=30, cfg_scale=3.0, output_length=500,
seed=42, device=str(device))
audio = decode_latent(dacvae, latent, device)
sf.write("out.wav", audio, 48000)
See demos, emotion demos, and angry voice clone for generated samples at this checkpoint.
Training summary
| Field | Value |
|---|---|
| Optimizer | AdamW (lr=1e-4, Ξ²=(0.9, 0.99), wd=0.01) |
| Schedule | WSD β 5% warmup, 75% stable, cosine decay to 1% |
| Grad clip | 1.0 |
| Mixed precision | bfloat16 autocast, fp32 master weights |
| Global batch | 2048 (32 per GCD Γ 64 GCDs) |
| Seq lengths | latent 768, text 512, speaker 512 |
| Text CFG drop | 10% |
| Speaker CFG drop | 10% |
| Steps | 50,000 (resumed from a 35k checkpoint of a sibling 50k run) |
| Hardware | 8 nodes Γ 8 GCDs of AMD MI250x on LUMI |
| Framework | PyTorch 2.7.1 (ROCm 6.2.4), flash-attn 2.7.3 (gfx90a), aws-ofi-rccl 1.4.0 |
WER progression across checkpoints:
| Step | median WER | mean CER |
|---|---|---|
| 30k | 0.97 | β |
| 40k | 0.90 | 0.98 |
| 45k | 0.87 | 0.58 |
| 50k | 0.79 | 0.56 |
Limitations
- Content alignment is weak. Long prompts tend to be paraphrased or truncated. WER of 0.78 means roughly 1 in 5 words still match the target prompt exactly.
- No explicit emotion conditioning. Emotion is inferred implicitly from punctuation and word choice in the prompt. The angry voice clone demo works by providing a reference clip that is already angry-sounding.
- Fixed 20s output. Sampling loop generates a fixed 500-frame latent (~20 s audio); trailing silence/low-energy tail is trimmed before ASR.
- Watermarked DACVAE. Generated audio inherits facebook's
dacvae-watermarkedinaudible watermark.
License
Apache-2.0 for the model weights and code. Training data licenses apply to the derived model; please review the TTS-AGI/podcast-tokenized-bg2.5-enj4.5 and TTS-AGI/emolia-hq-tokenized dataset cards before commercial use.
Citation
If you use this model, please cite it as:
@misc{openechotts_50k_2026,
author = {Wijngaard, Gijs},
title = {OpenEchoTTS β 50K-step BlockDiT flow-matching TTS checkpoint},
year = {2026},
url = {https://huggingface.co/gijs/openechotts-50k},
}
- Downloads last month
- 38