SoviaMate-Codec

Pretrained weights for SoviaMate-Codec, a neural audio codec designed from the ground up for integration with speech-aware large language models.

SoviaMate-Codec is the first released component of SoviaMate β€” an open research effort building toward end-to-end spoken dialogue systems.

🚧 Status: alpha research release. APIs are not stable; evaluation numbers are preliminary.

What's in this repository

samson-ailabs/SoviaMate-Codec
β”œβ”€β”€ neural_audio_codec/
β”‚   β”œβ”€β”€ audio_codec_base.ckpt   # reconstruction codec
β”‚   └── audio_codec_spk.ckpt    # voice-conversion codec (+ ASR head)
└── speaker_verification/
    β”œβ”€β”€ campplus.bin            # CAM++ speaker verifier
    β”œβ”€β”€ eres2netv2.ckpt         # ERes2Net-v2 speaker verifier
    └── wavlm_ecapa.pth         # WavLM + ECAPA-TDNN speaker verifier
Asset Purpose
neural_audio_codec/audio_codec_base.ckpt Reconstruction codec. Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.)
neural_audio_codec/audio_codec_spk.ckpt Voice-conversion codec. Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3–5 s reference. Always pass a speaker prompt β€” running it without one under-conditions the decoder and degrades quality. Use base for plain reconstruction.
speaker_verification/* Pretrained speaker-embedding extractors. campplus.bin and eres2netv2.ckpt are interchangeable backbones for the speaker adapter β€” whichever was used at training is also required at inference time for that spk checkpoint (this release uses campplus.bin). wavlm_ecapa.pth is for evaluation only (e.g., SECS-style speaker-similarity scoring).

Each codec checkpoint is a portable export containing model_weights (per-module state_dict) and hyper_parameters (architecture config), produced by AudioCodecTask.export_model(). Optimizer state, discriminators, and other training-only components are excluded.

Architecture at a glance

Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:

  1. ASR decoder before quantization (spk checkpoint only) β€” A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
  2. Continuous features for LLM input β€” Discrete tokens are used only for compression/transmission. The downstream LLM consumes the pre-quantization continuous features, avoiding quantization loss in the LLM input path.
  3. Speech enhancement as a training paradigm β€” The codec is trained noisy-in β†’ clean-out, so the encoder learns to discard noise rather than encode it.
  4. Post-quantization speaker adapter (spk checkpoint only) β€” A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3–5 s reference.

Full architecture write-up: SoviaMate repository. A technical report is in preparation.

Load in Python

Download just what you need:

# Reconstruction only (base checkpoint)
hf download samson-ailabs/SoviaMate-Codec \
    --include "neural_audio_codec/audio_codec_base.ckpt" \
    --local-dir checkpoints

# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
hf download samson-ailabs/SoviaMate-Codec \
    --include "neural_audio_codec/audio_codec_spk.ckpt" \
    --include "speaker_verification/campplus.bin" \
    --local-dir checkpoints

Then, after installing SoviaMate (see Getting started), load a checkpoint into an AudioCodecBundle. Pick the checkpoint that matches the task β€” they are not interchangeable.

Reconstruction β€” use the base checkpoint

from soviamate.bundles import AudioCodecBundle

reconstructor = AudioCodecBundle.from_checkpoint(
    "checkpoints/neural_audio_codec/audio_codec_base.ckpt",
    device="cuda",  # or "cpu"
)

# Compress β†’ decode
reconstructed, _ = reconstructor(source_audio)

Voice conversion (+ optional ASR transcript) β€” use the spk checkpoint

voice_converter = AudioCodecBundle.from_checkpoint(
    "checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
    device="cuda",
)

# Convert source speech to a target speaker via a 3–5 s reference
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)

# Voice conversion with an ASR transcript as a by-product
converted, transcript = voice_converter(
    source_audio, prompt_audios=target_speaker_audio, return_text=True
)

⚠️ Do not call the spk bundle without prompt_audios β€” the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.

Streaming (low-latency inference)

Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.

# Reconstruction streaming (base checkpoint)
state = reconstructor.init_stream(chunk_size=8)
for chunk in audio_chunks:
    waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)

# Voice-conversion streaming (spk checkpoint)
state = voice_converter.init_stream(
    chunk_size=8,
    prompt_audio=target_speaker_audio,
    return_text=True,  # optional incremental transcript
)
for chunk in audio_chunks:
    waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)

See soviamate/bundles/codec.py for the full API.

Training data

The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β€” contributions of multilingual training pipelines are welcome at the project repository.

Intended use

  • Research on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
  • Educational exploration of ASR-constrained codec training and zero-shot speaker adaptation.
  • Engineering experimentation as a building block for downstream speech-to-speech systems.

Out-of-scope / responsible-use note

The post-quantization speaker adapter supports zero-shot voice cloning from a few seconds of reference audio. These weights must not be used for:

  • impersonation, fraud, or any form of non-consensual voice synthesis;
  • producing audio attributed to a real person without their explicit, informed consent;
  • deceptive, harassing, or otherwise harmful generation.

Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.

Limitations

  • English-only training data; performance on other languages is untested.
  • Preliminary checkpoint β€” comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
  • Streaming inference is implemented (init_stream / stream_chunk) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.

License

Apache License 2.0 β€” see LICENSE.

The speaker-verification weights under speaker_verification/ are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.

Citation

A technical report is in preparation. For now, please cite:

@misc{soviamate2026,
  author       = {Son Dang Dinh (Samson)},
  title        = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
  year         = {2026},
  howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}

Contact

For research collaboration, dataset partnerships, or compute grants: samson.ailabs@gmail.com (subject line: SoviaMate collaboration). For code-level discussion, open an issue or discussion on the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support