docs: drop size column for cleaner table rendering

8d8f3c3 verified 3 days ago

8.93 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- audio
	- speech
	- neural-audio-codec
	- speech-codec
	- speech-llm
	- speech-to-speech
	- zero-shot-voice-cloning
	- speech-enhancement
	- asr
	- pytorch
	library_name: pytorch
	pipeline_tag: audio-to-audio
	---

	# SoviaMate-Codec

	Pretrained weights for SoviaMate-Codec, a neural audio codec designed from the ground up for integration with speech-aware large language models.

	SoviaMate-Codec is the first released component of [SoviaMate](https://github.com/samson-ailabs/SoviaMate) — an open research effort building toward end-to-end spoken dialogue systems.

	> 🚧 Status: alpha research release. APIs are not stable; evaluation numbers are preliminary.

	## What's in this repository

	```
	samson-ailabs/SoviaMate-Codec
	├── neural_audio_codec/
	│ ├── audio_codec_base.ckpt # reconstruction codec
	│ └── audio_codec_spk.ckpt # voice-conversion codec (+ ASR head)
	└── speaker_verification/
	├── campplus.bin # CAM++ speaker verifier
	├── eres2netv2.ckpt # ERes2Net-v2 speaker verifier
	└── wavlm_ecapa.pth # WavLM + ECAPA-TDNN speaker verifier
	```

	\| Asset \| Purpose \|
	\|---\|---\|
	\| `neural_audio_codec/audio_codec_base.ckpt` \| Reconstruction codec. Encoder + quantizer + decoder, trained as a standard compress / reconstruct codec without the speaker-adaptation objective. Use for low-bitrate speech coding and feature extraction. (No ASR head.) \|
	\| `neural_audio_codec/audio_codec_spk.ckpt` \| Voice-conversion codec. Adds the integrated ASR head and the post-quantization speaker adapter trained for zero-shot voice swapping from a 3–5 s reference. Always pass a speaker prompt — running it without one under-conditions the decoder and degrades quality. Use `base` for plain reconstruction. \|
	\| `speaker_verification/*` \| Pretrained speaker-embedding extractors. `campplus.bin` and `eres2netv2.ckpt` are interchangeable backbones for the speaker adapter — whichever was used at training is also required at inference time for that `spk` checkpoint (this release uses `campplus.bin`). `wavlm_ecapa.pth` is for evaluation only (e.g., SECS-style speaker-similarity scoring). \|

	Each codec checkpoint is a portable export containing `model_weights` (per-module `state_dict`) and `hyper_parameters` (architecture config), produced by `AudioCodecTask.export_model()`. Optimizer state, discriminators, and other training-only components are excluded.

	## Architecture at a glance

	Four design choices distinguish SoviaMate-Codec from EnCodec / SoundStream / DAC:

	1. *ASR decoder before* quantization** (spk checkpoint only) — A lightweight ASR head reads the encoder's continuous features. Its gradient forces linguistic content into the representation, so semantic fidelity is directly measurable (WER), not assumed.
	2. Continuous features for LLM input — Discrete tokens are used only for compression/transmission. The downstream LLM consumes the pre-quantization continuous features, avoiding quantization loss in the LLM input path.
	3. Speech enhancement as a training paradigm — The codec is trained noisy-in → clean-out, so the encoder learns to discard noise rather than encode it.
	4. Post-quantization speaker adapter (spk checkpoint only) — A hybrid AdaLN + cross-attention adapter injects voice identity after quantization. This decouples "what is said" from "who says it" and enables zero-shot voice swapping from a 3–5 s reference.

	Full architecture write-up: [SoviaMate repository](https://github.com/samson-ailabs/SoviaMate). A technical report is in preparation.

	## Load in Python

	Download just what you need:
	```bash
	# Reconstruction only (base checkpoint)
	hf download samson-ailabs/SoviaMate-Codec \
	--include "neural_audio_codec/audio_codec_base.ckpt" \
	--local-dir checkpoints

	# Voice conversion (spk checkpoint + the campplus speaker verifier it depends on)
	hf download samson-ailabs/SoviaMate-Codec \
	--include "neural_audio_codec/audio_codec_spk.ckpt" \
	--include "speaker_verification/campplus.bin" \
	--local-dir checkpoints
	```

	Then, after installing SoviaMate (see [Getting started](https://github.com/samson-ailabs/SoviaMate#getting-started)), load a checkpoint into an `AudioCodecBundle`. Pick the checkpoint that matches the task — they are not interchangeable.

	### Reconstruction — use the `base` checkpoint
	```python
	from soviamate.bundles import AudioCodecBundle

	reconstructor = AudioCodecBundle.from_checkpoint(
	"checkpoints/neural_audio_codec/audio_codec_base.ckpt",
	device="cuda", # or "cpu"
	)

	# Compress → decode
	reconstructed, _ = reconstructor(source_audio)
	```

	### Voice conversion (+ optional ASR transcript) — use the `spk` checkpoint
	```python
	voice_converter = AudioCodecBundle.from_checkpoint(
	"checkpoints/neural_audio_codec/audio_codec_spk.ckpt",
	device="cuda",
	)

	# Convert source speech to a target speaker via a 3–5 s reference
	converted, _ = voice_converter(source_audio, prompt_audios=target_speaker_audio)

	# Voice conversion with an ASR transcript as a by-product
	converted, transcript = voice_converter(
	source_audio, prompt_audios=target_speaker_audio, return_text=True
	)
	```

	> ⚠️ Do not call the `spk` bundle without `prompt_audios` — the speaker adapter expects a prompt at inference time; calling it without one leaves the decoder under-conditioned and audio quality drops.

	### Streaming (low-latency inference)

	Both bundles expose the same streaming API; the call signature differs only in whether you pass a speaker prompt and whether a transcript comes back.

	```python
	# Reconstruction streaming (base checkpoint)
	state = reconstructor.init_stream(chunk_size=8)
	for chunk in audio_chunks:
	waveform_chunk, _, state = reconstructor.stream_chunk(chunk, state)

	# Voice-conversion streaming (spk checkpoint)
	state = voice_converter.init_stream(
	chunk_size=8,
	prompt_audio=target_speaker_audio,
	return_text=True, # optional incremental transcript
	)
	for chunk in audio_chunks:
	waveform_chunk, text_chunk, state = voice_converter.stream_chunk(chunk, state)
	```

	See [`soviamate/bundles/codec.py`](https://github.com/samson-ailabs/SoviaMate/blob/main/soviamate/bundles/codec.py) for the full API.

	## Training data

	The released checkpoints were trained on publicly available English speech corpora (LibriHeavy and derivatives). Multilingual checkpoints are not yet available — contributions of multilingual training pipelines are welcome at the [project repository](https://github.com/samson-ailabs/SoviaMate).

	## Intended use

	- Research on neural audio codecs, speech LLMs, and end-to-end spoken dialogue systems.
	- Educational exploration of ASR-constrained codec training and zero-shot speaker adaptation.
	- Engineering experimentation as a building block for downstream speech-to-speech systems.

	## Out-of-scope / responsible-use note

	The post-quantization speaker adapter supports zero-shot voice cloning from a few seconds of reference audio. These weights must not be used for:
	- impersonation, fraud, or any form of non-consensual voice synthesis;
	- producing audio attributed to a real person without their explicit, informed consent;
	- deceptive, harassing, or otherwise harmful generation.

	Outputs may reflect biases in the training data. Users are responsible for compliance with applicable law and platform policies.

	## Limitations

	- English-only training data; performance on other languages is untested.
	- Preliminary checkpoint — comprehensive objective benchmarks (PESQ / ViSQOL / WER / SECS vs. EnCodec / SoundStream / DAC) have not yet been published.
	- Streaming inference is implemented (`init_stream` / `stream_chunk`) but has not yet been benchmarked end-to-end for production-grade latency or multi-session throughput.

	## License

	Apache License 2.0 — see [LICENSE](https://github.com/samson-ailabs/SoviaMate/blob/main/LICENSE).

	The speaker-verification weights under `speaker_verification/` are redistributed for convenience from their original authors; please consult and respect the licenses of those individual upstream projects (CAM++, ERes2Net-v2, WavLM, ECAPA-TDNN) when using or redistributing them.

	## Citation

	A technical report is in preparation. For now, please cite:

	```bibtex
	@misc{soviamate2026,
	author = {Son Dang Dinh (Samson)},
	title = {SoviaMate: Toward End-to-End Spoken Dialogue Systems},
	year = {2026},
	howpublished = {\url{https://github.com/samson-ailabs/SoviaMate}},
	}
	```

	## Contact

	For research collaboration, dataset partnerships, or compute grants: samson.ailabs@gmail.com (subject line: `SoviaMate collaboration`). For code-level discussion, open an issue or discussion on the [GitHub repository](https://github.com/samson-ailabs/SoviaMate/issues).