AEmotionStudio
/

voiceshield-targets

voice-conversion

voice-protection

Model card Files Files and versions

voiceshield-targets / README.md

AEmotionStudio's picture

Add model card

f3129e0 verified 12 days ago

|

history blame contribute delete

2.64 kB

	---
	license: bsd-3-clause
	library_name: pytorch
	tags:
	- voice-conversion
	- knn-vc
	- voice-protection
	- maestro
	- voiceshield
	---

	# Voice Shield · kNN-VC HiFi-GAN

	Speaker anonymization checkpoint used by the Voice Shield panel in
	[MAESTRO](https://github.com/AEmotionStudio/MAESTRO)'s AI Workstation
	(Tools → Voice Shield → Anonymize tab).

	## What this is

	The prematched HiFi-GAN generator from
	[bshall/knn-vc](https://github.com/bshall/knn-vc), repackaged as a
	single `.safetensors` file. Trained to vocode 1024-dim WavLM-Large layer-6
	features back to 16 kHz audio. Pairs with `microsoft/wavlm-large` (used
	upstream from the HuggingFace Hub directly — no need for a separate mirror).

	- Architecture: HiFi-GAN with a `lin_pre` Linear(1024→512) prefix
	- Parameters: ~16.5 M
	- Input: WavLM-Large layer-6 hidden states `[B, T, 1024]`
	- Output: 16 kHz waveform `[B, samples]` (320× upsample)
	- License: BSD-3-Clause (original kNN-VC license preserved)

	## What it's used for in MAESTRO

	The Voice Shield panel offers voice-cloning protection with three
	threat models. The Anonymize tab uses this checkpoint to transform a
	user's voice into a different synthetic speaker via the kNN-VC pipeline:

	1. Extract WavLM-Large layer-6 features from the user's voice.
	2. For each frame, find the k=4 nearest matches in a "target speaker"
	feature pool and average them.
	3. Vocode the matched features back to audio with this HiFi-GAN.

	The output sounds like the target speaker — so any voice-cloning model
	trained on the output learns the target's identity, not the user's.
	This is the only voice-protection paradigm currently robust against
	adversarial-perturbation strippers like LightShed.

	## Loading

	```python
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file

	path = hf_hub_download(repo_id="AEmotionStudio/voiceshield-targets",
	filename="hifigan_knnvc.safetensors")
	state = load_file(path)
	# state-dict keys match the bshall prematch_g_02500000.pt schema.
	```

	MAESTRO loads this through the vendor module at
	`backend/ai/voiceshield/knn_vc_vendor/hifigan.py`.

	## Credits

	- Original kNN-VC: Matthew Baas, Benjamin van Niekerk, Herman Kamper —
	["Voice Conversion With Just Nearest Neighbors"](https://arxiv.org/abs/2305.18975), Interspeech 2023.
	Code: <https://github.com/bshall/knn-vc>.
	- WavLM-Large: Microsoft (<https://huggingface.co/microsoft/wavlm-large>).

	## Honest framing

	Anonymization is a one-way transform. The original speaker identity is
	unrecoverable from the output — that's the point, and the trade-off the
	user opts into.