Echolancer Stage 3 Zero-Shot TTS

A decoder-only transformer text-to-speech model capable of zero-shot voice cloning from a single reference audio sample.

Model Description

Echolancer is a neural codec language model for text-to-speech synthesis. This Stage 3 checkpoint enables zero-shot voice cloning - the ability to synthesize speech in any voice given just a few seconds of reference audio.

Key Features

🎯 Zero-Shot Voice Cloning: Clone any voice from a single audio sample
🔊 High Quality Audio: 24kHz output via NeuCodec, optionally enhanced to 48kHz with Brontes
⚡ Efficient Inference: Decoder-only architecture with KV caching
🎛️ Controllable Generation: Temperature and nucleus sampling parameters

Architecture

Component	Details
Architecture	Decoder-only Transformer
Positional Encoding	ALiBi (Attention with Linear Biases)
Audio Tokenizer	NeuCodec (65,536 vocab)
Speaker Encoder	ECAPA-TDNN (192-dim embeddings)
Text Tokenizer	Character-level

Intended Use

This model is designed for:

Voice cloning applications
Personalized TTS systems
Creative audio content generation
Accessibility tools

Usage

Installation

git clone https://github.com/ZDisket/Echolancer
cd Echolancer
pip install -r requirements.txt

Quick Start

import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
from echolancerfe import EcholancerFE
from neucodecfe import NeuCodecFE

# Load models
echolancer = EcholancerFE(model_config_path="config/model_stage3_zs.yaml")
echolancer.load_checkpoint("path/to/checkpoint.pt")

neu_codec = NeuCodecFE(is_cuda=True, offset=echolancer.get_vocab_offset())
speaker_encoder = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")

# Extract speaker embedding from reference audio
def get_speaker_embedding(audio_path):
    signal, fs = torchaudio.load(audio_path)
    signal = signal.mean(dim=0, keepdim=True)  # mono
    if fs != 16000:
        signal = torchaudio.transforms.Resample(fs, 16000)(signal)
    return speaker_encoder.encode_batch(signal).squeeze(0)

# Generate speech
speaker_emb = get_speaker_embedding("reference.wav")

with torch.amp.autocast(device_type='cuda', dtype=torch.bfloat16):
    codes = echolancer.infer(
        text="Hello, this is a test.",
        speaker_id=speaker_emb,
        temperature=0.8,
        top_p=0.92,
        max_length=1024
    )
    
    # Decode to audio
    waveform = neu_codec.decode_codes(codes.unsqueeze(1))
    torchaudio.save("output.wav", waveform[0].cpu(), 24000)

Training

This is a Stage 3 zero-shot checkpoint, trained to:

Accept speaker embeddings from ECAPA-TDNN encoder
Generate speaker-conditioned audio tokens
Generalize to unseen speakers

Limitations

English only: Currently trained on English data
Reference quality: Output quality depends on reference audio clarity
Short references: May struggle with very short (<1s) reference clips
Prosody: Does not clone prosody/emotion, only voice timbre

Ethical Considerations

This technology can generate speech that sounds like real people. Users should:

Obtain consent before cloning someone's voice
Not use for deception, fraud, or impersonation
Clearly label AI-generated content
Follow applicable laws regarding synthetic media

ZDisket
/

echolancer-stage3-zs