| --- |
| license: bsd-3-clause |
| library_name: pytorch |
| tags: |
| - voice-conversion |
| - knn-vc |
| - voice-protection |
| - maestro |
| - voiceshield |
| --- |
| |
| # Voice Shield · kNN-VC HiFi-GAN |
|
|
| Speaker anonymization checkpoint used by the **Voice Shield** panel in |
| [MAESTRO](https://github.com/AEmotionStudio/MAESTRO)'s AI Workstation |
| (Tools → Voice Shield → Anonymize tab). |
|
|
| ## What this is |
|
|
| The **prematched HiFi-GAN generator** from |
| [bshall/knn-vc](https://github.com/bshall/knn-vc), repackaged as a |
| single `.safetensors` file. Trained to vocode 1024-dim WavLM-Large layer-6 |
| features back to 16 kHz audio. Pairs with `microsoft/wavlm-large` (used |
| upstream from the HuggingFace Hub directly — no need for a separate mirror). |
|
|
| - Architecture: HiFi-GAN with a `lin_pre` Linear(1024→512) prefix |
| - Parameters: ~16.5 M |
| - Input: WavLM-Large layer-6 hidden states `[B, T, 1024]` |
| - Output: 16 kHz waveform `[B, samples]` (320× upsample) |
| - License: BSD-3-Clause (original kNN-VC license preserved) |
|
|
| ## What it's used for in MAESTRO |
|
|
| The Voice Shield panel offers **voice-cloning protection** with three |
| threat models. The Anonymize tab uses this checkpoint to transform a |
| user's voice into a different synthetic speaker via the kNN-VC pipeline: |
|
|
| 1. Extract WavLM-Large layer-6 features from the user's voice. |
| 2. For each frame, find the k=4 nearest matches in a "target speaker" |
| feature pool and average them. |
| 3. Vocode the matched features back to audio with this HiFi-GAN. |
|
|
| The output sounds like the target speaker — so any voice-cloning model |
| trained on the output learns the target's identity, not the user's. |
| This is the **only** voice-protection paradigm currently robust against |
| adversarial-perturbation strippers like LightShed. |
|
|
| ## Loading |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| from safetensors.torch import load_file |
| |
| path = hf_hub_download(repo_id="AEmotionStudio/voiceshield-targets", |
| filename="hifigan_knnvc.safetensors") |
| state = load_file(path) |
| # state-dict keys match the bshall prematch_g_02500000.pt schema. |
| ``` |
|
|
| MAESTRO loads this through the vendor module at |
| `backend/ai/voiceshield/knn_vc_vendor/hifigan.py`. |
|
|
| ## Credits |
|
|
| - Original kNN-VC: Matthew Baas, Benjamin van Niekerk, Herman Kamper — |
| ["Voice Conversion With Just Nearest Neighbors"](https://arxiv.org/abs/2305.18975), Interspeech 2023. |
| Code: <https://github.com/bshall/knn-vc>. |
| - WavLM-Large: Microsoft (<https://huggingface.co/microsoft/wavlm-large>). |
|
|
| ## Honest framing |
|
|
| Anonymization is a one-way transform. The original speaker identity is |
| unrecoverable from the output — that's the point, and the trade-off the |
| user opts into. |
|
|