samson-ailabs commited on
Commit
574cd84
Β·
verified Β·
1 Parent(s): febb3b8

Initial v0.1 alpha release

Browse files
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - audio
7
+ - speech
8
+ - neural-audio-codec
9
+ - speech-codec
10
+ - speech-llm
11
+ - speech-to-speech
12
+ - zero-shot-voice-cloning
13
+ - speech-enhancement
14
+ - asr
15
+ - pytorch
16
+ library_name: pytorch
17
+ pipeline_tag: audio-to-audio
18
+ ---
19
+
20
+ # SoviaMate-Codec
21
+
22
+ Pretrained weights for **SoviaMate-Codec**, a neural audio codec designed from the ground up for integration with speech-aware large language models.
23
+
24
+ SoviaMate-Codec is the first released component of [**SoviaMate**](https://github.com/samson-ailabs/SoviaMate) β€” an open research effort building toward end-to-end spoken dialogue systems.
25
+
26
+ > 🚧 **Status**: alpha research release. APIs are not stable; evaluation numbers are preliminary.
27
+
28
+ ## What's in this repository
29
+
30
+ ```
31
+ samson-ailabs/SoviaMate-Codec
32
+ β”œβ”€β”€ neural_audio_codec/
33
+ β”‚ β”œβ”€β”€ audio_codec_base.ckpt # reconstruction codec
34
+ β”‚ └── audio_codec_spk.ckpt # voice-conversion codec (+ ASR head)
35
+ └── speaker_verification/
36
+ β”œβ”€β”€ campplus.bin # CAM++ speaker verifier
37
+ β”œβ”€β”€ eres2netv2.ckpt # ERes2Net-v2 speaker verifier
38
+ └── wavlm_ecapa.pth # WavLM + ECAPA-TDNN speaker verifier
39
+ ```
40
+
41
+ | Asset | Purpose | Size |
42
+ |---|---|---|
43
+ | `neural_audio_codec/audio_codec_base.ckpt` | **Reconstruction codec.** Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) | ~753 MB |
44
+ | `neural_audio_codec/audio_codec_spk.ckpt` | **Voice-conversion codec.** Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3–5 s reference. Always pass a speaker prompt β€” running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction. | ~939 MB |
45
+ | `speaker_verification/*` | Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter β€” whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring). | ~1.3 GB total |
46
+
47
+ Each codec checkpoint is a portable export containing `model_weights` (per-module `state_dict`) and `hyper_parameters` (architecture config), produced by `AudioCodecTask.export_model()`. Optimizer state, discriminators, and other training-only components are excluded.
48
+
49
+ ## Architecture at a glance
50
+
51
+ Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:
52
+
53
+ 1. **ASR decoder *before* quantization** *(spk checkpoint only)* β€” A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
54
+ 2. **Continuous features for LLM input** β€” Discrete tokens are used only for compression/transmission. The downstream LLM consumes the *pre-quantization* continuous features, avoiding quantization loss in the LLM input path.
55
+ 3. **Speech enhancement as a training paradigm** β€” The codec is trained noisy-in β†’ clean-out, so the encoder learns to discard noise rather than encode it.
56
+ 4. **Post-quantization speaker adapter** *(spk checkpoint only)* β€” A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3–5 s reference.
57
+
58
+ Full architecture write-up: [SoviaMate repository](https://github.com/samson-ailabs/SoviaMate). A technical report is in preparation.
59
+
60
+ ## Load in Python
61
+
62
+ Download just what you need:
63
+ ```bash
64
+ # Reconstruction only (base checkpoint)
65
+ hf download samson-ailabs/SoviaMate-Codec \
66
+ --include "neural_audio_codec/audio_codec_base.ckpt" \
67
+ --local-dir checkpoints
68
+
69
+ # Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
70
+ hf download samson-ailabs/SoviaMate-Codec \
71
+ --include "neural_audio_codec/audio_codec_spk.ckpt" \
72
+ --include "speaker_verification/campplus.bin" \
73
+ --local-dir checkpoints
74
+ ```
75
+
76
+ Then, after installing SoviaMate (see [Getting started](https://github.com/samson-ailabs/SoviaMate#getting-started)), load a checkpoint into an `AudioCodecBundle`. Pick the checkpoint that matches the task β€” they are **not** interchangeable.
77
+
78
+ ### Reconstruction β€” use the `base` checkpoint
79
+ ```python
80
+ from soviamate.bundles import AudioCodecBundle
81
+
82
+ reconstructor = AudioCodecBundle.from_checkpoint(
83
+ "checkpoints/neural_audio_codec/audio_codec_base.ckpt",
84
+ device="cuda", # or "cpu"
85
+ )
86
+
87
+ # Compress β†’ decode
88
+ reconstructed, _ = reconstructor(source_audio)
89
+ ```
90
+
91
+ ### Voice conversion (+ optional ASR transcript) β€” use the `spk` checkpoint
92
+ ```python
93
+ voice_converter = AudioCodecBundle.from_checkpoint(
94
+ "checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
95
+ device="cuda",
96
+ )
97
+
98
+ # Convert source speech to a target speaker via a 3–5 s reference
99
+ converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)
100
+
101
+ # Voice conversion with an ASR transcript as a by-product
102
+ converted, transcript = voice_converter(
103
+ source_audio, prompt_audios=target_speaker_audio, return_text=True
104
+ )
105
+ ```
106
+
107
+ > ⚠️ Do not call the `spk` bundle without `prompt_audios` β€” the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.
108
+
109
+ ### Streaming (low-latency inference)
110
+
111
+ Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.
112
+
113
+ ```python
114
+ # Reconstruction streaming (base checkpoint)
115
+ state = reconstructor.init_stream(chunk_size=8)
116
+ for chunk in audio_chunks:
117
+ waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)
118
+
119
+ # Voice-conversion streaming (spk checkpoint)
120
+ state = voice_converter.init_stream(
121
+ chunk_size=8,
122
+ prompt_audio=target_speaker_audio,
123
+ return_text=True, # optional incremental transcript
124
+ )
125
+ for chunk in audio_chunks:
126
+ waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)
127
+ ```
128
+
129
+ See [`soviamate/bundles/codec.py`](https://github.com/samson-ailabs/SoviaMate/blob/main/soviamate/bundles/codec.py) for the full API.
130
+
131
+ ## Training data
132
+
133
+ The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β€” contributions of multilingual training pipelines are welcome at the [project repository](https://github.com/samson-ailabs/SoviaMate).
134
+
135
+ ## Intended use
136
+
137
+ - **Research** on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
138
+ - **Educational** exploration of ASR-constrained codec training and zero-shot speaker adaptation.
139
+ - **Engineering experimentation** as a building block for downstream speech-to-speech systems.
140
+
141
+ ## Out-of-scope / responsible-use note
142
+
143
+ The post-quantization speaker adapter supports **zero-shot voice cloning** from a few seconds of reference audio. These weights **must not** be used for:
144
+ - impersonation, fraud, or any form of non-consensual voice synthesis;
145
+ - producing audio attributed to a real person without their explicit, informed consent;
146
+ - deceptive, harassing, or otherwise harmful generation.
147
+
148
+ Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.
149
+
150
+ ## Limitations
151
+
152
+ - English-only training data; performance on other languages is untested.
153
+ - Preliminary checkpoint β€” comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
154
+ - Streaming inference is implemented (`init_stream` / `stream_chunk`) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.
155
+
156
+ ## License
157
+
158
+ Apache License 2.0 β€” see [LICENSE](https://github.com/samson-ailabs/SoviaMate/blob/main/LICENSE).
159
+
160
+ The speaker-verification weights under `speaker_verification/` are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.
161
+
162
+ ## Citation
163
+
164
+ A technical report is in preparation. For now, please cite:
165
+
166
+ ```bibtex
167
+ @misc{soviamate2026,
168
+ author = {Son Dang Dinh (Samson)},
169
+ title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
170
+ year = {2026},
171
+ howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
172
+ }
173
+ ```
174
+
175
+ ## Contact
176
+
177
+ For research collaboration, dataset partnerships, or compute grants: **samson.ailabs@gmail.com** (subject line: `SoviaMate collaboration`). For code-level discussion, open an issue or discussion on the [GitHub repository](https://github.com/samson-ailabs/SoviaMate/issues).
neural_audio_codec/audio_codec_base.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb3650f96718e620e5f2cf37e676046c7274f07142723a2ba9fdbe04fdea3252
3
+ size 747544365
neural_audio_codec/audio_codec_spk.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf1c41f429cb58d46aa24e12246cba0c788a3362591c84b48f656f9379ba72fa
3
+ size 984911111
speaker_verification/campplus.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3388cf5fd3493c9ac9c69851d8e7a8badcfb4f3dc631020c4961371646d5ada8
3
+ size 28036335
speaker_verification/eres2netv2.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0eb4057106b2573dd7b132cf0c36273ab29afd192c1610f80baa9c556dbb963c
3
+ size 71768231
speaker_verification/wavlm_ecapa.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51f07e3b94d9e0262a6a675ef5a087be3dd09e8c62e9d886827f44f82fe7f94b
3
+ size 1301926579