Echolancer Stage 3 Zero-Shot TTS
A decoder-only transformer text-to-speech model capable of zero-shot voice cloning from a single reference audio sample.
Model Description
Echolancer is a neural codec language model for text-to-speech synthesis. This Stage 3 checkpoint enables zero-shot voice cloning - the ability to synthesize speech in any voice given just a few seconds of reference audio.
Key Features
- π― Zero-Shot Voice Cloning: Clone any voice from a single audio sample
- π High Quality Audio: 24kHz output via NeuCodec, optionally enhanced to 48kHz with Brontes
- β‘ Efficient Inference: Decoder-only architecture with KV caching
- ποΈ Controllable Generation: Temperature and nucleus sampling parameters
Architecture
| Component | Details |
|---|---|
| Architecture | Decoder-only Transformer |
| Positional Encoding | ALiBi (Attention with Linear Biases) |
| Audio Tokenizer | NeuCodec (65,536 vocab) |
| Speaker Encoder | ECAPA-TDNN (192-dim embeddings) |
| Text Tokenizer | Character-level |
Intended Use
This model is designed for:
- Voice cloning applications
- Personalized TTS systems
- Creative audio content generation
- Accessibility tools
Usage
Installation
git clone https://github.com/ZDisket/Echolancer
cd Echolancer
pip install -r requirements.txt
Quick Start
import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
from echolancerfe import EcholancerFE
from neucodecfe import NeuCodecFE
# Load models
echolancer = EcholancerFE(model_config_path="config/model_stage3_zs.yaml")
echolancer.load_checkpoint("path/to/checkpoint.pt")
neu_codec = NeuCodecFE(is_cuda=True, offset=echolancer.get_vocab_offset())
speaker_encoder = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
# Extract speaker embedding from reference audio
def get_speaker_embedding(audio_path):
signal, fs = torchaudio.load(audio_path)
signal = signal.mean(dim=0, keepdim=True) # mono
if fs != 16000:
signal = torchaudio.transforms.Resample(fs, 16000)(signal)
return speaker_encoder.encode_batch(signal).squeeze(0)
# Generate speech
speaker_emb = get_speaker_embedding("reference.wav")
with torch.amp.autocast(device_type='cuda', dtype=torch.bfloat16):
codes = echolancer.infer(
text="Hello, this is a test.",
speaker_id=speaker_emb,
temperature=0.8,
top_p=0.92,
max_length=1024
)
# Decode to audio
waveform = neu_codec.decode_codes(codes.unsqueeze(1))
torchaudio.save("output.wav", waveform[0].cpu(), 24000)
Training
This is a Stage 3 zero-shot checkpoint, trained to:
- Accept speaker embeddings from ECAPA-TDNN encoder
- Generate speaker-conditioned audio tokens
- Generalize to unseen speakers
Limitations
- English only: Currently trained on English data
- Reference quality: Output quality depends on reference audio clarity
- Short references: May struggle with very short (<1s) reference clips
- Prosody: Does not clone prosody/emotion, only voice timbre
Ethical Considerations
This technology can generate speech that sounds like real people. Users should:
- Obtain consent before cloning someone's voice
- Not use for deception, fraud, or impersonation
- Clearly label AI-generated content
- Follow applicable laws regarding synthetic media
Links
- π¦ Repository: https://github.com/ZDisket/Echolancer
- π License: MIT