voiceshield-targets / README.md
AEmotionStudio's picture
Add model card
f3129e0 verified
---
license: bsd-3-clause
library_name: pytorch
tags:
- voice-conversion
- knn-vc
- voice-protection
- maestro
- voiceshield
---
# Voice Shield · kNN-VC HiFi-GAN
Speaker anonymization checkpoint used by the **Voice Shield** panel in
[MAESTRO](https://github.com/AEmotionStudio/MAESTRO)'s AI Workstation
(Tools → Voice Shield → Anonymize tab).
## What this is
The **prematched HiFi-GAN generator** from
[bshall/knn-vc](https://github.com/bshall/knn-vc), repackaged as a
single `.safetensors` file. Trained to vocode 1024-dim WavLM-Large layer-6
features back to 16 kHz audio. Pairs with `microsoft/wavlm-large` (used
upstream from the HuggingFace Hub directly — no need for a separate mirror).
- Architecture: HiFi-GAN with a `lin_pre` Linear(1024→512) prefix
- Parameters: ~16.5 M
- Input: WavLM-Large layer-6 hidden states `[B, T, 1024]`
- Output: 16 kHz waveform `[B, samples]` (320× upsample)
- License: BSD-3-Clause (original kNN-VC license preserved)
## What it's used for in MAESTRO
The Voice Shield panel offers **voice-cloning protection** with three
threat models. The Anonymize tab uses this checkpoint to transform a
user's voice into a different synthetic speaker via the kNN-VC pipeline:
1. Extract WavLM-Large layer-6 features from the user's voice.
2. For each frame, find the k=4 nearest matches in a "target speaker"
feature pool and average them.
3. Vocode the matched features back to audio with this HiFi-GAN.
The output sounds like the target speaker — so any voice-cloning model
trained on the output learns the target's identity, not the user's.
This is the **only** voice-protection paradigm currently robust against
adversarial-perturbation strippers like LightShed.
## Loading
```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
path = hf_hub_download(repo_id="AEmotionStudio/voiceshield-targets",
filename="hifigan_knnvc.safetensors")
state = load_file(path)
# state-dict keys match the bshall prematch_g_02500000.pt schema.
```
MAESTRO loads this through the vendor module at
`backend/ai/voiceshield/knn_vc_vendor/hifigan.py`.
## Credits
- Original kNN-VC: Matthew Baas, Benjamin van Niekerk, Herman Kamper —
["Voice Conversion With Just Nearest Neighbors"](https://arxiv.org/abs/2305.18975), Interspeech 2023.
Code: <https://github.com/bshall/knn-vc>.
- WavLM-Large: Microsoft (<https://huggingface.co/microsoft/wavlm-large>).
## Honest framing
Anonymization is a one-way transform. The original speaker identity is
unrecoverable from the output — that's the point, and the trade-off the
user opts into.