SoviaMate-Codec
Pretrained weights for SoviaMate-Codec, a neural audio codec designed from the ground up for integration with speech-aware large language models.
SoviaMate-Codec is the first released component of SoviaMate β an open research effort building toward end-to-end spoken dialogue systems.
π§ Status: alpha research release. APIs are not stable; evaluation numbers are preliminary.
What's in this repository
samson-ailabs/SoviaMate-Codec
βββ neural_audio_codec/
β βββ audio_codec_base.ckpt # reconstruction codec
β βββ audio_codec_spk.ckpt # voice-conversion codec (+ ASR head)
βββ speaker_verification/
βββ campplus.bin # CAM++ speaker verifier
βββ eres2netv2.ckpt # ERes2Net-v2 speaker verifier
βββ wavlm_ecapa.pth # WavLM + ECAPA-TDNN speaker verifier
| Asset | Purpose |
|---|---|
neural_audio_codec/audio_codec_base.ckpt |
Reconstruction codec. Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) |
neural_audio_codec/audio_codec_spk.ckpt |
Voice-conversion codec. Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3β5 s reference. Always pass a speaker prompt β running it without one under-conditions the decoder and degrades quality. Use base for plain reconstruction. |
speaker_verification/* |
Pretrained speaker-embedding extractors. campplus.bin and eres2netv2.ckpt are interchangeable backbones for the speaker adapter β whichever was used at training is also required at inference time for that spk checkpoint (this release uses campplus.bin). wavlm_ecapa.pth is for evaluation only (e.g., SECS-style speaker-similarity scoring). |
Each codec checkpoint is a portable export containing model_weights (per-module state_dict) and hyper_parameters (architecture config), produced by AudioCodecTask.export_model(). Optimizer state, discriminators, and other training-only components are excluded.
Architecture at a glance
Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:
- ASR decoder before quantization (spk checkpoint only) β A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
- Continuous features for LLM input β Discrete tokens are used only for compression/transmission. The downstream LLM consumes the pre-quantization continuous features, avoiding quantization loss in the LLM input path.
- Speech enhancement as a training paradigm β The codec is trained noisy-in β clean-out, so the encoder learns to discard noise rather than encode it.
- Post-quantization speaker adapter (spk checkpoint only) β A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3β5 s reference.
Full architecture write-up: SoviaMate repository. A technical report is in preparation.
Load in Python
Download just what you need:
# Reconstruction only (base checkpoint)
hf download samson-ailabs/SoviaMate-Codec \
--include "neural_audio_codec/audio_codec_base.ckpt" \
--local-dir checkpoints
# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
hf download samson-ailabs/SoviaMate-Codec \
--include "neural_audio_codec/audio_codec_spk.ckpt" \
--include "speaker_verification/campplus.bin" \
--local-dir checkpoints
Then, after installing SoviaMate (see Getting started), load a checkpoint into an AudioCodecBundle. Pick the checkpoint that matches the task β they are not interchangeable.
Reconstruction β use the base checkpoint
from soviamate.bundles import AudioCodecBundle
reconstructor = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_base.ckpt",
device="cuda", # or "cpu"
)
# Compress β decode
reconstructed, _ = reconstructor(source_audio)
Voice conversion (+ optional ASR transcript) β use the spk checkpoint
voice_converter = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
device="cuda",
)
# Convert source speech to a target speaker via a 3β5 s reference
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)
# Voice conversion with an ASR transcript as a by-product
converted, transcript = voice_converter(
source_audio, prompt_audios=target_speaker_audio, return_text=True
)
β οΈ Do not call the
spkbundle withoutprompt_audiosβ the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.
Streaming (low-latency inference)
Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.
# Reconstruction streaming (base checkpoint)
state = reconstructor.init_stream(chunk_size=8)
for chunk in audio_chunks:
waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)
# Voice-conversion streaming (spk checkpoint)
state = voice_converter.init_stream(
chunk_size=8,
prompt_audio=target_speaker_audio,
return_text=True, # optional incremental transcript
)
for chunk in audio_chunks:
waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)
See soviamate/bundles/codec.py for the full API.
Training data
The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β contributions of multilingual training pipelines are welcome at the project repository.
Intended use
- Research on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
- Educational exploration of ASR-constrained codec training and zero-shot speaker adaptation.
- Engineering experimentation as a building block for downstream speech-to-speech systems.
Out-of-scope / responsible-use note
The post-quantization speaker adapter supports zero-shot voice cloning from a few seconds of reference audio. These weights must not be used for:
- impersonation, fraud, or any form of non-consensual voice synthesis;
- producing audio attributed to a real person without their explicit, informed consent;
- deceptive, harassing, or otherwise harmful generation.
Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.
Limitations
- English-only training data; performance on other languages is untested.
- Preliminary checkpoint β comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
- Streaming inference is implemented (
init_stream/stream_chunk) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.
License
Apache License 2.0 β see LICENSE.
The speaker-verification weights under speaker_verification/ are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.
Citation
A technical report is in preparation. For now, please cite:
@misc{soviamate2026,
author = {Son Dang Dinh (Samson)},
title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
year = {2026},
howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}
Contact
For research collaboration, dataset partnerships, or compute grants: samson.ailabs@gmail.com (subject line: SoviaMate collaboration). For code-level discussion, open an issue or discussion on the GitHub repository.