SoviaMate-Codec / README.md
samson-ailabs's picture
docs: drop size column for cleaner table rendering
8d8f3c3 verified
---
license: apache-2.0
language:
- en
tags:
- audio
- speech
- neural-audio-codec
- speech-codec
- speech-llm
- speech-to-speech
- zero-shot-voice-cloning
- speech-enhancement
- asr
- pytorch
library_name: pytorch
pipeline_tag: audio-to-audio
---
# SoviaMate-Codec
Pretrained weights for **SoviaMate-Codec**, a neural audio codec designed from the ground up for integration with speech-aware large language models.
SoviaMate-Codec is the first released component of [**SoviaMate**](https://github.com/samson-ailabs/SoviaMate) β€” an open research effort building toward end-to-end spoken dialogue systems.
> 🚧 **Status**: alpha research release. APIs are not stable; evaluation numbers are preliminary.
## What's in this repository
```
samson-ailabs/SoviaMate-Codec
β”œβ”€β”€ neural_audio_codec/
β”‚ β”œβ”€β”€ audio_codec_base.ckpt # reconstruction codec
β”‚ └── audio_codec_spk.ckpt # voice-conversion codec (+ ASR head)
└── speaker_verification/
β”œβ”€β”€ campplus.bin # CAM++ speaker verifier
β”œβ”€β”€ eres2netv2.ckpt # ERes2Net-v2 speaker verifier
└── wavlm_ecapa.pth # WavLM + ECAPA-TDNN speaker verifier
```
| Asset | Purpose |
|---|---|
| `neural_audio_codec/audio_codec_base.ckpt` | **Reconstruction codec.** Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) |
| `neural_audio_codec/audio_codec_spk.ckpt` | **Voice-conversion codec.** Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3–5 s reference. Always pass a speaker prompt β€” running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction. |
| `speaker_verification/*` | Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter β€” whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring). |
Each codec checkpoint is a portable export containing `model_weights` (per-module `state_dict`) and `hyper_parameters` (architecture config), produced by `AudioCodecTask.export_model()`. Optimizer state, discriminators, and other training-only components are excluded.
## Architecture at a glance
Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:
1. **ASR decoder *before* quantization** *(spk checkpoint only)* β€” A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
2. **Continuous features for LLM input** β€” Discrete tokens are used only for compression/transmission. The downstream LLM consumes the *pre-quantization* continuous features, avoiding quantization loss in the LLM input path.
3. **Speech enhancement as a training paradigm** β€” The codec is trained noisy-in β†’ clean-out, so the encoder learns to discard noise rather than encode it.
4. **Post-quantization speaker adapter** *(spk checkpoint only)* β€” A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3–5 s reference.
Full architecture write-up: [SoviaMate repository](https://github.com/samson-ailabs/SoviaMate). A technical report is in preparation.
## Load in Python
Download just what you need:
```bash
# Reconstruction only (base checkpoint)
hf download samson-ailabs/SoviaMate-Codec \
--include "neural_audio_codec/audio_codec_base.ckpt" \
--local-dir checkpoints
# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
hf download samson-ailabs/SoviaMate-Codec \
--include "neural_audio_codec/audio_codec_spk.ckpt" \
--include "speaker_verification/campplus.bin" \
--local-dir checkpoints
```
Then, after installing SoviaMate (see [Getting started](https://github.com/samson-ailabs/SoviaMate#getting-started)), load a checkpoint into an `AudioCodecBundle`. Pick the checkpoint that matches the task β€” they are **not** interchangeable.
### Reconstruction β€” use the `base` checkpoint
```python
from soviamate.bundles import AudioCodecBundle
reconstructor = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_base.ckpt",
device="cuda", # or "cpu"
)
# Compress β†’ decode
reconstructed, _ = reconstructor(source_audio)
```
### Voice conversion (+ optional ASR transcript) β€” use the `spk` checkpoint
```python
voice_converter = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
device="cuda",
)
# Convert source speech to a target speaker via a 3–5 s reference
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)
# Voice conversion with an ASR transcript as a by-product
converted, transcript = voice_converter(
source_audio, prompt_audios=target_speaker_audio, return_text=True
)
```
> ⚠️ Do not call the `spk` bundle without `prompt_audios` β€” the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.
### Streaming (low-latency inference)
Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.
```python
# Reconstruction streaming (base checkpoint)
state = reconstructor.init_stream(chunk_size=8)
for chunk in audio_chunks:
waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)
# Voice-conversion streaming (spk checkpoint)
state = voice_converter.init_stream(
chunk_size=8,
prompt_audio=target_speaker_audio,
return_text=True, # optional incremental transcript
)
for chunk in audio_chunks:
waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)
```
See [`soviamate/bundles/codec.py`](https://github.com/samson-ailabs/SoviaMate/blob/main/soviamate/bundles/codec.py) for the full API.
## Training data
The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β€” contributions of multilingual training pipelines are welcome at the [project repository](https://github.com/samson-ailabs/SoviaMate).
## Intended use
- **Research** on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
- **Educational** exploration of ASR-constrained codec training and zero-shot speaker adaptation.
- **Engineering experimentation** as a building block for downstream speech-to-speech systems.
## Out-of-scope / responsible-use note
The post-quantization speaker adapter supports **zero-shot voice cloning** from a few seconds of reference audio. These weights **must not** be used for:
- impersonation, fraud, or any form of non-consensual voice synthesis;
- producing audio attributed to a real person without their explicit, informed consent;
- deceptive, harassing, or otherwise harmful generation.
Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.
## Limitations
- English-only training data; performance on other languages is untested.
- Preliminary checkpoint β€” comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
- Streaming inference is implemented (`init_stream` / `stream_chunk`) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.
## License
Apache License 2.0 β€” see [LICENSE](https://github.com/samson-ailabs/SoviaMate/blob/main/LICENSE).
The speaker-verification weights under `speaker_verification/` are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.
## Citation
A technical report is in preparation. For now, please cite:
```bibtex
@misc{soviamate2026,
author = {Son Dang Dinh (Samson)},
title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
year = {2026},
howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}
```
## Contact
For research collaboration, dataset partnerships, or compute grants: **samson.ailabs@gmail.com** (subject line: `SoviaMate collaboration`). For code-level discussion, open an issue or discussion on the [GitHub repository](https://github.com/samson-ailabs/SoviaMate/issues).