SoviaMate-Codec

Pretrained weights for SoviaMate-Codec, a neural audio codec designed from the ground up for integration with speech-aware large language models.

SoviaMate-Codec is the first released component of SoviaMate — an open research effort building toward end-to-end spoken dialogue systems.

🚧 Status: alpha research release. APIs are not stable; evaluation numbers are preliminary.

What's in this repository

samson-ailabs/SoviaMate-Codec
├── neural_audio_codec/
│   ├── audio_codec_base.ckpt   # reconstruction codec
│   └── audio_codec_spk.ckpt    # voice-conversion codec (+ ASR head)
└── speaker_verification/
    ├── campplus.bin            # CAM++ speaker verifier
    ├── eres2netv2.ckpt         # ERes2Net-v2 speaker verifier
    └── wavlm_ecapa.pth         # WavLM + ECAPA-TDNN speaker verifier

Asset	Purpose
`neural_audio_codec/audio_codec_base.ckpt`	Reconstruction codec. Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.)
`neural_audio_codec/audio_codec_spk.ckpt`	Voice-conversion codec. Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3–5 s reference. Always pass a speaker prompt — running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction.
`speaker_verification/*`	Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter — whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring).

Each codec checkpoint is a portable export containing model_weights (per-module state_dict) and hyper_parameters (architecture config), produced by AudioCodecTask.export_model(). Optimizer state, discriminators, and other training-only components are excluded.

Architecture at a glance

Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:

ASR decoder before quantization (spk checkpoint only) — A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
Continuous features for LLM input — Discrete tokens are used only for compression/transmission. The downstream LLM consumes the pre-quantization continuous features, avoiding quantization loss in the LLM input path.
Speech enhancement as a training paradigm — The codec is trained noisy-in → clean-out, so the encoder learns to discard noise rather than encode it.
Post-quantization speaker adapter (spk checkpoint only) — A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3–5 s reference.

Full architecture write-up: SoviaMate repository. A technical report is in preparation.

Load in Python

Download just what you need:

# Reconstruction only (base checkpoint)
hf download samson-ailabs/SoviaMate-Codec \
    --include "neural_audio_codec/audio_codec_base.ckpt" \
    --local-dir checkpoints

# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
hf download samson-ailabs/SoviaMate-Codec \
    --include "neural_audio_codec/audio_codec_spk.ckpt" \
    --include "speaker_verification/campplus.bin" \
    --local-dir checkpoints

Then, after installing SoviaMate (see Getting started), load a checkpoint into an AudioCodecBundle. Pick the checkpoint that matches the task — they are not interchangeable.

Reconstruction — use the `base` checkpoint

from soviamate.bundles import AudioCodecBundle

reconstructor = AudioCodecBundle.from_checkpoint(
    "checkpoints/neural_audio_codec/audio_codec_base.ckpt",
    device="cuda",  # or "cpu"
)

# Compress → decode
reconstructed, _ = reconstructor(source_audio)

Voice conversion (+ optional ASR transcript) — use the `spk` checkpoint

voice_converter = AudioCodecBundle.from_checkpoint(
    "checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
    device="cuda",
)

# Convert source speech to a target speaker via a 3–5 s reference
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)

# Voice conversion with an ASR transcript as a by-product
converted, transcript = voice_converter(
    source_audio, prompt_audios=target_speaker_audio, return_text=True
)

⚠️ Do not call the spk bundle without prompt_audios — the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.

Streaming (low-latency inference)

Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.

# Reconstruction streaming (base checkpoint)
state = reconstructor.init_stream(chunk_size=8)
for chunk in audio_chunks:
    waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)

# Voice-conversion streaming (spk checkpoint)
state = voice_converter.init_stream(
    chunk_size=8,
    prompt_audio=target_speaker_audio,
    return_text=True,  # optional incremental transcript
)
for chunk in audio_chunks:
    waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)

See soviamate/bundles/codec.py for the full API.

Training data

The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available — contributions of multilingual training pipelines are welcome at the project repository.

Intended use

Research on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
Educational exploration of ASR-constrained codec training and zero-shot speaker adaptation.
Engineering experimentation as a building block for downstream speech-to-speech systems.

Out-of-scope / responsible-use note

The post-quantization speaker adapter supports zero-shot voice cloning from a few seconds of reference audio. These weights must not be used for:

impersonation, fraud, or any form of non-consensual voice synthesis;
producing audio attributed to a real person without their explicit, informed consent;
deceptive, harassing, or otherwise harmful generation.

Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.

Limitations

English-only training data; performance on other languages is untested.
Preliminary checkpoint — comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
Streaming inference is implemented (init_stream / stream_chunk) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.

License

Apache License 2.0 — see LICENSE.

The speaker-verification weights under speaker_verification/ are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.

Citation

A technical report is in preparation. For now, please cite:

@misc{soviamate2026,
  author       = {Son Dang Dinh (Samson)},
  title        = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
  year         = {2026},
  howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}

Contact

For research collaboration, dataset partnerships, or compute grants: samson.ailabs@gmail.com (subject line: SoviaMate collaboration). For code-level discussion, open an issue or discussion on the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track