| --- |
| license: mit |
| tags: |
| - audio |
| - voice-conversion |
| - rvc |
| - safetensors |
| - maestraea |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # RVC Inference Models (Safetensors) |
|
|
| **Retrieval-Based Voice Conversion — V2 Pretrained Models** |
|
|
| [Original Source](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) · MIT License |
|
|
| > V2 pretrained models converted from `.pth` to safetensors format (except HuBERT which requires fairseq for deserialization). For use with [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). |
|
|
| ## What's in This Repo |
|
|
| ### Core Models |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `hubert_base.pt` | 190 MB | HuBERT feature extractor (kept as .pt — requires fairseq) | |
| | `rmvpe.safetensors` | 181 MB | RMVPE pitch detection model | |
|
|
| ### Pretrained V2 — Generator Models (Inference) |
|
|
| | File | Size | Sample Rate | |
| |------|------|-------------| |
| | `pretrained_v2/G32k.safetensors` | 74 MB | 32kHz | |
| | `pretrained_v2/G40k.safetensors` | 73 MB | 40kHz | |
| | `pretrained_v2/G48k.safetensors` | 75 MB | 48kHz | |
| | `pretrained_v2/f0G32k.safetensors` | 74 MB | 32kHz (with F0) | |
| | `pretrained_v2/f0G40k.safetensors` | 73 MB | 40kHz (with F0) | |
| | `pretrained_v2/f0G48k.safetensors` | 75 MB | 48kHz (with F0) | |
|
|
| ### Pretrained V2 — Discriminator Models (Training) |
|
|
| | File | Size | Sample Rate | |
| |------|------|-------------| |
| | `pretrained_v2/D32k.safetensors` | 143 MB | 32kHz | |
| | `pretrained_v2/D40k.safetensors` | 143 MB | 40kHz | |
| | `pretrained_v2/D48k.safetensors` | 143 MB | 48kHz | |
| | `pretrained_v2/f0D32k.safetensors` | 143 MB | 32kHz (with F0) | |
| | `pretrained_v2/f0D40k.safetensors` | 143 MB | 40kHz (with F0) | |
| | `pretrained_v2/f0D48k.safetensors` | 143 MB | 48kHz (with F0) | |
|
|
| **Total: ~1.7 GB** (inference-only subset of the full 80 GB RVC repo) |
|
|
| ## What RVC Does |
|
|
| RVC (Retrieval-based Voice Conversion) converts vocals from one voice to another: |
|
|
| - **Batch mode** — Upload audio → convert → download result |
| - **Real-time mode** — Low-latency WebSocket streaming (future) |
| - Voice models are small (~50–100 MB each) and user-provided |
|
|
| ### Key Parameters |
|
|
| | Parameter | Range | Default | Description | |
| |-----------|-------|---------|-------------| |
| | `pitch_shift` | -12 to 12 | 0 | Semitone shift | |
| | `f0_method` | rmvpe/crepe/harvest | rmvpe | Pitch detection | |
| | `index_rate` | 0–1 | 0.75 | Retrieval index strength | |
| | `protect` | 0–0.5 | 0.33 | Protect voiceless consonants | |
|
|
| ### VRAM Requirements |
|
|
| - **Minimum**: ~2 GB |
| - **Recommended**: ~6 GB |
|
|
| ## Usage with Mæstræa |
|
|
| Place in `~/.maestraea/models/rvc/`. Voice model files (`.pth` + `.index`) go in `~/.maestraea/models/rvc/voices/`. |
|
|
| ## License |
|
|
| MIT — same as the original RVC release. |
|
|
| ## Credits |
|
|
| - **Model**: [RVC-Project](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) |
| - **Original weights**: [lj1995/VoiceConversionWebUI](https://huggingface.co/lj1995/VoiceConversionWebUI) |
| - **Conversion & Mirror by**: [AEmotionStudio](https://huggingface.co/AEmotionStudio) |
|
|