File size: 8,929 Bytes
574cd84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d8f3c3
 
 
 
 
574cd84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: apache-2.0
language:
  - en
tags:
  - audio
  - speech
  - neural-audio-codec
  - speech-codec
  - speech-llm
  - speech-to-speech
  - zero-shot-voice-cloning
  - speech-enhancement
  - asr
  - pytorch
library_name: pytorch
pipeline_tag: audio-to-audio
---

# SoviaMate-Codec

Pretrained weights for **SoviaMate-Codec**, a neural audio codec designed from the ground up for integration with speech-aware large language models.

SoviaMate-Codec is the first released component of [**SoviaMate**](https://github.com/samson-ailabs/SoviaMate) β€” an open research effort building toward end-to-end spoken dialogue systems.

> 🚧 **Status**: alpha research release. APIs are not stable; evaluation numbers are preliminary.

## What's in this repository

```
samson-ailabs/SoviaMate-Codec
β”œβ”€β”€ neural_audio_codec/
β”‚   β”œβ”€β”€ audio_codec_base.ckpt   # reconstruction codec
β”‚   └── audio_codec_spk.ckpt    # voice-conversion codec (+ ASR head)
└── speaker_verification/
    β”œβ”€β”€ campplus.bin            # CAM++ speaker verifier
    β”œβ”€β”€ eres2netv2.ckpt         # ERes2Net-v2 speaker verifier
    └── wavlm_ecapa.pth         # WavLM + ECAPA-TDNN speaker verifier
```

| Asset | Purpose |
|---|---|
| `neural_audio_codec/audio_codec_base.ckpt` | **Reconstruction codec.** Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) |
| `neural_audio_codec/audio_codec_spk.ckpt` | **Voice-conversion codec.** Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3–5 s reference. Always pass a speaker prompt β€” running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction. |
| `speaker_verification/*` | Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter β€” whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring). |

Each codec checkpoint is a portable export containing `model_weights` (per-module `state_dict`) and `hyper_parameters` (architecture config), produced by `AudioCodecTask.export_model()`. Optimizer state, discriminators, and other training-only components are excluded.

## Architecture at a glance

Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:

1. **ASR decoder *before* quantization** *(spk checkpoint only)* β€” A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
2. **Continuous features for LLM input** β€” Discrete tokens are used only for compression/transmission. The downstream LLM consumes the *pre-quantization* continuous features, avoiding quantization loss in the LLM input path.
3. **Speech enhancement as a training paradigm** β€” The codec is trained noisy-in β†’ clean-out, so the encoder learns to discard noise rather than encode it.
4. **Post-quantization speaker adapter** *(spk checkpoint only)* β€” A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3–5 s reference.

Full architecture write-up: [SoviaMate repository](https://github.com/samson-ailabs/SoviaMate). A technical report is in preparation.

## Load in Python

Download just what you need:
```bash
# Reconstruction only (base checkpoint)
hf download samson-ailabs/SoviaMate-Codec \
    --include "neural_audio_codec/audio_codec_base.ckpt" \
    --local-dir checkpoints

# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
hf download samson-ailabs/SoviaMate-Codec \
    --include "neural_audio_codec/audio_codec_spk.ckpt" \
    --include "speaker_verification/campplus.bin" \
    --local-dir checkpoints
```

Then, after installing SoviaMate (see [Getting started](https://github.com/samson-ailabs/SoviaMate#getting-started)), load a checkpoint into an `AudioCodecBundle`. Pick the checkpoint that matches the task β€” they are **not** interchangeable.

### Reconstruction β€” use the `base` checkpoint
```python
from soviamate.bundles import AudioCodecBundle

reconstructor = AudioCodecBundle.from_checkpoint(
    "checkpoints/neural_audio_codec/audio_codec_base.ckpt",
    device="cuda",  # or "cpu"
)

# Compress β†’ decode
reconstructed, _ = reconstructor(source_audio)
```

### Voice conversion (+ optional ASR transcript) β€” use the `spk` checkpoint
```python
voice_converter = AudioCodecBundle.from_checkpoint(
    "checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
    device="cuda",
)

# Convert source speech to a target speaker via a 3–5 s reference
converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)

# Voice conversion with an ASR transcript as a by-product
converted, transcript = voice_converter(
    source_audio, prompt_audios=target_speaker_audio, return_text=True
)
```

> ⚠️ Do not call the `spk` bundle without `prompt_audios` β€” the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.

### Streaming (low-latency inference)

Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.

```python
# Reconstruction streaming (base checkpoint)
state = reconstructor.init_stream(chunk_size=8)
for chunk in audio_chunks:
    waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)

# Voice-conversion streaming (spk checkpoint)
state = voice_converter.init_stream(
    chunk_size=8,
    prompt_audio=target_speaker_audio,
    return_text=True,  # optional incremental transcript
)
for chunk in audio_chunks:
    waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)
```

See [`soviamate/bundles/codec.py`](https://github.com/samson-ailabs/SoviaMate/blob/main/soviamate/bundles/codec.py) for the full API.

## Training data

The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β€” contributions of multilingual training pipelines are welcome at the [project repository](https://github.com/samson-ailabs/SoviaMate).

## Intended use

- **Research** on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
- **Educational** exploration of ASR-constrained codec training and zero-shot speaker adaptation.
- **Engineering experimentation** as a building block for downstream speech-to-speech systems.

## Out-of-scope / responsible-use note

The post-quantization speaker adapter supports **zero-shot voice cloning** from a few seconds of reference audio. These weights **must not** be used for:
- impersonation, fraud, or any form of non-consensual voice synthesis;
- producing audio attributed to a real person without their explicit, informed consent;
- deceptive, harassing, or otherwise harmful generation.

Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.

## Limitations

- English-only training data; performance on other languages is untested.
- Preliminary checkpoint β€” comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
- Streaming inference is implemented (`init_stream` / `stream_chunk`) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.

## License

Apache License 2.0 β€” see [LICENSE](https://github.com/samson-ailabs/SoviaMate/blob/main/LICENSE).

The speaker-verification weights under `speaker_verification/` are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.

## Citation

A technical report is in preparation. For now, please cite:

```bibtex
@misc{soviamate2026,
  author       = {Son Dang Dinh (Samson)},
  title        = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
  year         = {2026},
  howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
}
```

## Contact

For research collaboration, dataset partnerships, or compute grants: **samson.ailabs@gmail.com** (subject line: `SoviaMate collaboration`). For code-level discussion, open an issue or discussion on the [GitHub repository](https://github.com/samson-ailabs/SoviaMate/issues).