File size: 8,929 Bytes
574cd84 8d8f3c3 574cd84 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
license: apache-2.0
language:
- en
tags:
- audio
- speech
- neural-audio-codec
- speech-codec
- speech-llm
- speech-to-speech
- zero-shot-voice-cloning
- speech-enhancement
- asr
- pytorch
library_name: pytorch
pipeline_tag: audio-to-audio
---
# SoviaMate-Codec
Pretrained weights for **SoviaMate-Codec**, a neural audio codec designed from the ground up for integration with speech-aware large language models.
SoviaMate-Codec is the first released component of [**SoviaMate**](https://github.com/samson-ailabs/SoviaMate) β an open research effort building toward end-to-end spoken dialogue systems.
> π§ **Status**: alpha research release. APIs are not stable; evaluation numbers are preliminary.
## What's in this repository
```
samson-ailabs/SoviaMate-Codec
βββ neural_audio_codec/
β βββ audio_codec_base.ckpt # reconstruction codec
β βββ audio_codec_spk.ckpt # voice-conversion codec (+ ASR head)
βββ speaker_verification/
βββ campplus.bin # CAM++ speaker verifier
βββ eres2netv2.ckpt # ERes2Net-v2 speaker verifier
βββ wavlm_ecapa.pth # WavLM + ECAPA-TDNN speaker verifier
```
| Asset | Purpose |
|---|---|
| `neural_audio_codec/audio_codec_base.ckpt` | **Reconstruction codec.** Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) |
| `neural_audio_codec/audio_codec_spk.ckpt` | **Voice-conversion codec.** Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3β5 s reference. Always pass a speaker prompt β running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction. |
| `speaker_verification/*` | Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter β whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring). |
Each codec checkpoint is a portable export containing `model_weights` (per-module `state_dict`) and `hyper_parameters` (architecture config), produced by `AudioCodecTask.export_model()`. Optimizer state, discriminators, and other training-only components are excluded.
## Architecture at a glance
Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:
1. **ASR decoder *before* quantization** *(spk checkpoint only)* β A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
2. **Continuous features for LLM input** β Discrete tokens are used only for compression/transmission. The downstream LLM consumes the *pre-quantization* continuous features, avoiding quantization loss in the LLM input path.
3. **Speech enhancement as a training paradigm** β The codec is trained noisy-in β clean-out, so the encoder learns to discard noise rather than encode it.
4. **Post-quantization speaker adapter** *(spk checkpoint only)* β A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3β5 s reference.
Full architecture write-up: [SoviaMate repository](https://github.com/samson-ailabs/SoviaMate). A technical report is in preparation.
## Load in Python
Download just what you need:
```bash
# Reconstruction only (base checkpoint)
hf download samson-ailabs/SoviaMate-Codec \
--include "neural_audio_codec/audio_codec_base.ckpt" \
--local-dir checkpoints
# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
hf download samson-ailabs/SoviaMate-Codec \
--include "neural_audio_codec/audio_codec_spk.ckpt" \
--include "speaker_verification/campplus.bin" \
--local-dir checkpoints
```
Then, after installing SoviaMate (see [Getting started](https://github.com/samson-ailabs/SoviaMate#getting-started)), load a checkpoint into an `AudioCodecBundle`. Pick the checkpoint that matches the task β they are **not** interchangeable.
### Reconstruction β use the `base` checkpoint
```python
from soviamate.bundles import AudioCodecBundle
reconstructor = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_base.ckpt",
device="cuda", # or "cpu"
)
# Compress β decode
reconstructed, _ = reconstructor(source_audio)
```
### Voice conversion (+ optional ASR transcript) β use the `spk` checkpoint
```python
voice_converter = AudioCodecBundle.from_checkpoint(
"checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
device="cuda",
)
# Convert source speech to a target speaker via a 3β5 s reference
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)
# Voice conversion with an ASR transcript as a by-product
converted, transcript = voice_converter(
source_audio, prompt_audios=target_speaker_audio, return_text=True
)
```
> β οΈ Do not call the `spk` bundle without `prompt_audios` β the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.
### Streaming (low-latency inference)
Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.
```python
# Reconstruction streaming (base checkpoint)
state = reconstructor.init_stream(chunk_size=8)
for chunk in audio_chunks:
waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)
# Voice-conversion streaming (spk checkpoint)
state = voice_converter.init_stream(
chunk_size=8,
prompt_audio=target_speaker_audio,
return_text=True, # optional incremental transcript
)
for chunk in audio_chunks:
waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)
```
See [`soviamate/bundles/codec.py`](https://github.com/samson-ailabs/SoviaMate/blob/main/soviamate/bundles/codec.py) for the full API.
## Training data
The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β contributions of multilingual training pipelines are welcome at the [project repository](https://github.com/samson-ailabs/SoviaMate).
## Intended use
- **Research** on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
- **Educational** exploration of ASR-constrained codec training and zero-shot speaker adaptation.
- **Engineering experimentation** as a building block for downstream speech-to-speech systems.
## Out-of-scope / responsible-use note
The post-quantization speaker adapter supports **zero-shot voice cloning** from a few seconds of reference audio. These weights **must not** be used for:
- impersonation, fraud, or any form of non-consensual voice synthesis;
- producing audio attributed to a real person without their explicit, informed consent;
- deceptive, harassing, or otherwise harmful generation.
Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.
## Limitations
- English-only training data; performance on other languages is untested.
- Preliminary checkpoint β comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
- Streaming inference is implemented (`init_stream` / `stream_chunk`) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.
## License
Apache License 2.0 β see [LICENSE](https://github.com/samson-ailabs/SoviaMate/blob/main/LICENSE).
The speaker-verification weights under `speaker_verification/` are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.
## Citation
A technical report is in preparation. For now, please cite:
```bibtex
@misc{soviamate2026,
author = {Son Dang Dinh (Samson)},
title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
year = {2026},
howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}
```
## Contact
For research collaboration, dataset partnerships, or compute grants: **samson.ailabs@gmail.com** (subject line: `SoviaMate collaboration`). For code-level discussion, open an issue or discussion on the [GitHub repository](https://github.com/samson-ailabs/SoviaMate/issues).
|