Echolancer Stage 3 Zero-Shot TTS

A decoder-only transformer text-to-speech model capable of zero-shot voice cloning from a single reference audio sample.

Model Description

Echolancer is a neural codec language model for text-to-speech synthesis. This Stage 3 checkpoint enables zero-shot voice cloning - the ability to synthesize speech in any voice given just a few seconds of reference audio.

Key Features

  • 🎯 Zero-Shot Voice Cloning: Clone any voice from a single audio sample
  • πŸ”Š High Quality Audio: 24kHz output via NeuCodec, optionally enhanced to 48kHz with Brontes
  • ⚑ Efficient Inference: Decoder-only architecture with KV caching
  • πŸŽ›οΈ Controllable Generation: Temperature and nucleus sampling parameters

Architecture

Component Details
Architecture Decoder-only Transformer
Positional Encoding ALiBi (Attention with Linear Biases)
Audio Tokenizer NeuCodec (65,536 vocab)
Speaker Encoder ECAPA-TDNN (192-dim embeddings)
Text Tokenizer Character-level

Intended Use

This model is designed for:

  • Voice cloning applications
  • Personalized TTS systems
  • Creative audio content generation
  • Accessibility tools

Usage

Installation

git clone https://github.com/ZDisket/Echolancer
cd Echolancer
pip install -r requirements.txt

Quick Start

import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
from echolancerfe import EcholancerFE
from neucodecfe import NeuCodecFE

# Load models
echolancer = EcholancerFE(model_config_path="config/model_stage3_zs.yaml")
echolancer.load_checkpoint("path/to/checkpoint.pt")

neu_codec = NeuCodecFE(is_cuda=True, offset=echolancer.get_vocab_offset())
speaker_encoder = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")

# Extract speaker embedding from reference audio
def get_speaker_embedding(audio_path):
    signal, fs = torchaudio.load(audio_path)
    signal = signal.mean(dim=0, keepdim=True)  # mono
    if fs != 16000:
        signal = torchaudio.transforms.Resample(fs, 16000)(signal)
    return speaker_encoder.encode_batch(signal).squeeze(0)

# Generate speech
speaker_emb = get_speaker_embedding("reference.wav")

with torch.amp.autocast(device_type='cuda', dtype=torch.bfloat16):
    codes = echolancer.infer(
        text="Hello, this is a test.",
        speaker_id=speaker_emb,
        temperature=0.8,
        top_p=0.92,
        max_length=1024
    )
    
    # Decode to audio
    waveform = neu_codec.decode_codes(codes.unsqueeze(1))
    torchaudio.save("output.wav", waveform[0].cpu(), 24000)

Training

This is a Stage 3 zero-shot checkpoint, trained to:

  1. Accept speaker embeddings from ECAPA-TDNN encoder
  2. Generate speaker-conditioned audio tokens
  3. Generalize to unseen speakers

Limitations

  • English only: Currently trained on English data
  • Reference quality: Output quality depends on reference audio clarity
  • Short references: May struggle with very short (<1s) reference clips
  • Prosody: Does not clone prosody/emotion, only voice timbre

Ethical Considerations

This technology can generate speech that sounds like real people. Users should:

  • Obtain consent before cloning someone's voice
  • Not use for deception, fraud, or impersonation
  • Clearly label AI-generated content
  • Follow applicable laws regarding synthetic media

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support