| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - audio |
| - speech |
| - neural-audio-codec |
| - speech-codec |
| - speech-llm |
| - speech-to-speech |
| - zero-shot-voice-cloning |
| - speech-enhancement |
| - asr |
| - pytorch |
| library_name: pytorch |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # SoviaMate-Codec |
|
|
| Pretrained weights for **SoviaMate-Codec**, a neural audio codec designed from the ground up for integration with speech-aware large language models. |
|
|
| SoviaMate-Codec is the first released component of [**SoviaMate**](https://github.com/samson-ailabs/SoviaMate) β an open research effort building toward end-to-end spoken dialogue systems. |
|
|
| > π§ **Status**: alpha research release. APIs are not stable; evaluation numbers are preliminary. |
|
|
| ## What's in this repository |
|
|
| ``` |
| samson-ailabs/SoviaMate-Codec |
| βββ neural_audio_codec/ |
| β βββ audio_codec_base.ckpt # reconstruction codec |
| β βββ audio_codec_spk.ckpt # voice-conversion codec (+ ASR head) |
| βββ speaker_verification/ |
| βββ campplus.bin # CAM++ speaker verifier |
| βββ eres2netv2.ckpt # ERes2Net-v2 speaker verifier |
| βββ wavlm_ecapa.pth # WavLM + ECAPA-TDNN speaker verifier |
| ``` |
|
|
| | Asset | Purpose | |
| |---|---| |
| | `neural_audio_codec/audio_codec_base.ckpt` | **Reconstruction codec.** Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) | |
| | `neural_audio_codec/audio_codec_spk.ckpt` | **Voice-conversion codec.** Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3β5 s reference. Always pass a speaker prompt β running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction. | |
| | `speaker_verification/*` | Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter β whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring). | |
|
|
| Each codec checkpoint is a portable export containing `model_weights` (per-module `state_dict`) and `hyper_parameters` (architecture config), produced by `AudioCodecTask.export_model()`. Optimizer state, discriminators, and other training-only components are excluded. |
|
|
| ## Architecture at a glance |
|
|
| Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC: |
|
|
| 1. **ASR decoder *before* quantization** *(spk checkpoint only)* β A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed. |
| 2. **Continuous features for LLM input** β Discrete tokens are used only for compression/transmission. The downstream LLM consumes the *pre-quantization* continuous features, avoiding quantization loss in the LLM input path. |
| 3. **Speech enhancement as a training paradigm** β The codec is trained noisy-in β clean-out, so the encoder learns to discard noise rather than encode it. |
| 4. **Post-quantization speaker adapter** *(spk checkpoint only)* β A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3β5 s reference. |
|
|
| Full architecture write-up: [SoviaMate repository](https://github.com/samson-ailabs/SoviaMate). A technical report is in preparation. |
|
|
| ## Load in Python |
|
|
| Download just what you need: |
| ```bash |
| # Reconstruction only (base checkpoint) |
| hf download samson-ailabs/SoviaMate-Codec \ |
| --include "neural_audio_codec/audio_codec_base.ckpt" \ |
| --local-dir checkpoints |
| |
| # Voice conversion (spk checkpoint + the campplus speaker verifier it depends on) |
| hf download samson-ailabs/SoviaMate-Codec \ |
| --include "neural_audio_codec/audio_codec_spk.ckpt" \ |
| --include "speaker_verification/campplus.bin" \ |
| --local-dir checkpoints |
| ``` |
|
|
| Then, after installing SoviaMate (see [Getting started](https://github.com/samson-ailabs/SoviaMate#getting-started)), load a checkpoint into an `AudioCodecBundle`. Pick the checkpoint that matches the task β they are **not** interchangeable. |
|
|
| ### Reconstruction β use the `base` checkpoint |
| ```python |
| from soviamate.bundles import AudioCodecBundle |
| |
| reconstructor = AudioCodecBundle.from_checkpoint( |
| "checkpoints/neural_audio_codec/audio_codec_base.ckpt", |
| device="cuda", # or "cpu" |
| ) |
| |
| # Compress β decode |
| reconstructed, _ = reconstructor(source_audio) |
| ``` |
|
|
| ### Voice conversion (+ optional ASR transcript) β use the `spk` checkpoint |
| ```python |
| voice_converter = AudioCodecBundle.from_checkpoint( |
| "checkpoints/neural_audio_codec/audio_codec_spk.ckpt", |
| device="cuda", |
| ) |
| |
| # Convert source speech to a target speaker via a 3β5 s reference |
| converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio) |
| |
| # Voice conversion with an ASR transcript as a by-product |
| converted, transcript = voice_converter( |
| source_audio, prompt_audios=target_speaker_audio, return_text=True |
| ) |
| ``` |
|
|
| > β οΈ Do not call the `spk` bundle without `prompt_audios` β the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops. |
| |
| ### Streaming (low-latency inference) |
| |
| Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back. |
| |
| ```python |
| # Reconstruction streaming (base checkpoint) |
| state = reconstructor.init_stream(chunk_size=8) |
| for chunk in audio_chunks: |
| waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state) |
| |
| # Voice-conversion streaming (spk checkpoint) |
| state = voice_converter.init_stream( |
| chunk_size=8, |
| prompt_audio=target_speaker_audio, |
| return_text=True, # optional incremental transcript |
| ) |
| for chunk in audio_chunks: |
| waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state) |
| ``` |
| |
| See [`soviamate/bundles/codec.py`](https://github.com/samson-ailabs/SoviaMate/blob/main/soviamate/bundles/codec.py) for the full API. |
|
|
| ## Training data |
|
|
| The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available β contributions of multilingual training pipelines are welcome at the [project repository](https://github.com/samson-ailabs/SoviaMate). |
|
|
| ## Intended use |
|
|
| - **Research** on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems. |
| - **Educational** exploration of ASR-constrained codec training and zero-shot speaker adaptation. |
| - **Engineering experimentation** as a building block for downstream speech-to-speech systems. |
|
|
| ## Out-of-scope / responsible-use note |
|
|
| The post-quantization speaker adapter supports **zero-shot voice cloning** from a few seconds of reference audio. These weights **must not** be used for: |
| - impersonation, fraud, or any form of non-consensual voice synthesis; |
| - producing audio attributed to a real person without their explicit, informed consent; |
| - deceptive, harassing, or otherwise harmful generation. |
|
|
| Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies. |
|
|
| ## Limitations |
|
|
| - English-only training data; performance on other languages is untested. |
| - Preliminary checkpoint β comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published. |
| - Streaming inference is implemented (`init_stream` / `stream_chunk`) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput. |
|
|
| ## License |
|
|
| Apache License 2.0 β see [LICENSE](https://github.com/samson-ailabs/SoviaMate/blob/main/LICENSE). |
|
|
| The speaker-verification weights under `speaker_verification/` are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them. |
|
|
| ## Citation |
|
|
| A technical report is in preparation. For now, please cite: |
|
|
| ```bibtex |
| @misc{soviamate2026, |
| author = {Son Dang Dinh (Samson)}, |
| title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems}, |
| year = {2026}, |
| howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}}, |
| } |
| ``` |
|
|
| ## Contact |
|
|
| For research collaboration, dataset partnerships, or compute grants: **samson.ailabs@gmail.com** (subject line: `SoviaMate collaboration`). For code-level discussion, open an issue or discussion on the [GitHub repository](https://github.com/samson-ailabs/SoviaMate/issues). |
|
|