File size: 4,296 Bytes
41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 41d886e e97c959 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | ---
license: mit
tags:
- audio
- voice-conversion
- singing-voice
- speech-synthesis
- text-to-speech
- vevo2
- amphion
- safetensors
- maestraea
pipeline_tag: audio-to-audio
base_model: RMSnow/Vevo2
---
# Vevo2 Models (Mæstræa Mirror)
**Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control**
[Original Weights](https://huggingface.co/RMSnow/Vevo2) by [RMSnow](https://huggingface.co/RMSnow) · [Source Code](https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevo2) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License
> Mirror of the inference-only files from `RMSnow/Vevo2`, packaged for use with the [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). Training artifacts (`optimizer.pt`, `scheduler.pt`, `rng_state_*.pth`, `trainer_state.json`, `training_args.bin`, the `_text` FM variant, and the `pretrained` AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors.
## What's in This Repo
| Path | Description | Size |
|------|-------------|------|
| `contentstyle_modeling/posttrained/model.safetensors` | AR transformer (Qwen2.5-0.5B post-trained) | ~970 MB |
| `acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors` | Flow-Matching transformer (~350M params) | ~1.4 GB |
| `acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt` | Per-channel mean/std for normed Whisper features | ~12 KB |
| `vocoder/model*.safetensors` | Vocos vocoder (~250M, sharded) | ~1.2 GB |
| `tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors` | Content-style tokenizer (FVQ16384 @ 12.5 Hz) | ~234 MB |
| `tokenizer/prosody_fvq512_6.25hz/model.safetensors` | Prosody tokenizer (FVQ512 @ 6.25 Hz) | ~261 MB |
| `contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …}` | AR text tokenizer + configs | ~22 MB |
| `*/config.json`, `amphion_config.json`, etc. | Per-component configs | small |
**Total: ~4 GB**
> Whisper-medium (~1.5 GB, used by the content-style tokenizer at inference time) is **not** mirrored here — `openai-whisper` will pull it to `~/.cache/whisper` on first run.
## What Vevo2 Does
Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline:
- **Convert** — voice/timbre conversion, FM-only (fastest)
- **TTS** — zero-shot text-to-speech / text-to-singing from a short reference clip
- **Edit** — rewrite words while preserving voice, melody, prosody, and style
- **Style** — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody
- **Melody** — sing target lyrics over a humming, whistled, or instrumental melody
- **SVC** — full singing voice conversion via the AR + FM pipeline (deeper than Convert)
### Architecture
- **AR Model** (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling
- **Flow-Matching Transformer** (~350M) — acoustic generation
- **Vocos Vocoder** (~250M) — high-quality 24 kHz waveform synthesis
- **Content-style + Prosody Tokenizers** — FVQ codecs over Whisper / chromagram features
### VRAM Requirements
| Reference Length | VRAM (GPU, FP16) |
|------------------|------------------|
| 15 s | ~6 GB |
| 30 s | ~10 GB |
| 45 s | ~12 GB |
Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost.
## Usage in Mæstræa
The Mæstræa runner clones [open-mmlab/Amphion](https://github.com/open-mmlab/Amphion) to `~/.maestraea/libs/amphion/` on first model load and imports the pipeline from `models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline`. The download manager pulls these weights to `~/.maestraea/models/vevo2/` and the runner resolves checkpoints from that directory at the paths shown in the table above.
For the standalone Amphion path, see the [upstream Vevo2 README](https://github.com/open-mmlab/Amphion/blob/main/models/svc/vevo2/README.md).
## License
MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.
|