| --- |
| license: mit |
| tags: |
| - audio |
| - voice-conversion |
| - singing-voice |
| - speech-synthesis |
| - text-to-speech |
| - vevo2 |
| - amphion |
| - safetensors |
| - maestraea |
| pipeline_tag: audio-to-audio |
| base_model: RMSnow/Vevo2 |
| --- |
| |
| # Vevo2 Models (Mæstræa Mirror) |
|
|
| **Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control** |
|
|
| [Original Weights](https://huggingface.co/RMSnow/Vevo2) by [RMSnow](https://huggingface.co/RMSnow) · [Source Code](https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevo2) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License |
|
|
| > Mirror of the inference-only files from `RMSnow/Vevo2`, packaged for use with the [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). Training artifacts (`optimizer.pt`, `scheduler.pt`, `rng_state_*.pth`, `trainer_state.json`, `training_args.bin`, the `_text` FM variant, and the `pretrained` AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors. |
| |
| ## What's in This Repo |
| |
| | Path | Description | Size | |
| |------|-------------|------| |
| | `contentstyle_modeling/posttrained/model.safetensors` | AR transformer (Qwen2.5-0.5B post-trained) | ~970 MB | |
| | `acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors` | Flow-Matching transformer (~350M params) | ~1.4 GB | |
| | `acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt` | Per-channel mean/std for normed Whisper features | ~12 KB | |
| | `vocoder/model*.safetensors` | Vocos vocoder (~250M, sharded) | ~1.2 GB | |
| | `tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors` | Content-style tokenizer (FVQ16384 @ 12.5 Hz) | ~234 MB | |
| | `tokenizer/prosody_fvq512_6.25hz/model.safetensors` | Prosody tokenizer (FVQ512 @ 6.25 Hz) | ~261 MB | |
| | `contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …}` | AR text tokenizer + configs | ~22 MB | |
| | `*/config.json`, `amphion_config.json`, etc. | Per-component configs | small | |
|
|
| **Total: ~4 GB** |
|
|
| > Whisper-medium (~1.5 GB, used by the content-style tokenizer at inference time) is **not** mirrored here — `openai-whisper` will pull it to `~/.cache/whisper` on first run. |
|
|
| ## What Vevo2 Does |
|
|
| Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline: |
|
|
| - **Convert** — voice/timbre conversion, FM-only (fastest) |
| - **TTS** — zero-shot text-to-speech / text-to-singing from a short reference clip |
| - **Edit** — rewrite words while preserving voice, melody, prosody, and style |
| - **Style** — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody |
| - **Melody** — sing target lyrics over a humming, whistled, or instrumental melody |
| - **SVC** — full singing voice conversion via the AR + FM pipeline (deeper than Convert) |
|
|
| ### Architecture |
|
|
| - **AR Model** (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling |
| - **Flow-Matching Transformer** (~350M) — acoustic generation |
| - **Vocos Vocoder** (~250M) — high-quality 24 kHz waveform synthesis |
| - **Content-style + Prosody Tokenizers** — FVQ codecs over Whisper / chromagram features |
|
|
| ### VRAM Requirements |
|
|
| | Reference Length | VRAM (GPU, FP16) | |
| |------------------|------------------| |
| | 15 s | ~6 GB | |
| | 30 s | ~10 GB | |
| | 45 s | ~12 GB | |
|
|
| Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost. |
|
|
| ## Usage in Mæstræa |
|
|
| The Mæstræa runner clones [open-mmlab/Amphion](https://github.com/open-mmlab/Amphion) to `~/.maestraea/libs/amphion/` on first model load and imports the pipeline from `models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline`. The download manager pulls these weights to `~/.maestraea/models/vevo2/` and the runner resolves checkpoints from that directory at the paths shown in the table above. |
|
|
| For the standalone Amphion path, see the [upstream Vevo2 README](https://github.com/open-mmlab/Amphion/blob/main/models/svc/vevo2/README.md). |
|
|
| ## License |
|
|
| MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent. |
|
|