--- license: mit tags: - audio - voice-conversion - singing-voice - speech-synthesis - text-to-speech - vevo2 - amphion - safetensors - maestraea pipeline_tag: audio-to-audio base_model: RMSnow/Vevo2 --- # Vevo2 Models (Mæstræa Mirror) **Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control** [Original Weights](https://huggingface.co/RMSnow/Vevo2) by [RMSnow](https://huggingface.co/RMSnow) · [Source Code](https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevo2) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License > Mirror of the inference-only files from `RMSnow/Vevo2`, packaged for use with the [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). Training artifacts (`optimizer.pt`, `scheduler.pt`, `rng_state_*.pth`, `trainer_state.json`, `training_args.bin`, the `_text` FM variant, and the `pretrained` AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors. ## What's in This Repo | Path | Description | Size | |------|-------------|------| | `contentstyle_modeling/posttrained/model.safetensors` | AR transformer (Qwen2.5-0.5B post-trained) | ~970 MB | | `acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors` | Flow-Matching transformer (~350M params) | ~1.4 GB | | `acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt` | Per-channel mean/std for normed Whisper features | ~12 KB | | `vocoder/model*.safetensors` | Vocos vocoder (~250M, sharded) | ~1.2 GB | | `tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors` | Content-style tokenizer (FVQ16384 @ 12.5 Hz) | ~234 MB | | `tokenizer/prosody_fvq512_6.25hz/model.safetensors` | Prosody tokenizer (FVQ512 @ 6.25 Hz) | ~261 MB | | `contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …}` | AR text tokenizer + configs | ~22 MB | | `*/config.json`, `amphion_config.json`, etc. | Per-component configs | small | **Total: ~4 GB** > Whisper-medium (~1.5 GB, used by the content-style tokenizer at inference time) is **not** mirrored here — `openai-whisper` will pull it to `~/.cache/whisper` on first run. ## What Vevo2 Does Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline: - **Convert** — voice/timbre conversion, FM-only (fastest) - **TTS** — zero-shot text-to-speech / text-to-singing from a short reference clip - **Edit** — rewrite words while preserving voice, melody, prosody, and style - **Style** — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody - **Melody** — sing target lyrics over a humming, whistled, or instrumental melody - **SVC** — full singing voice conversion via the AR + FM pipeline (deeper than Convert) ### Architecture - **AR Model** (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling - **Flow-Matching Transformer** (~350M) — acoustic generation - **Vocos Vocoder** (~250M) — high-quality 24 kHz waveform synthesis - **Content-style + Prosody Tokenizers** — FVQ codecs over Whisper / chromagram features ### VRAM Requirements | Reference Length | VRAM (GPU, FP16) | |------------------|------------------| | 15 s | ~6 GB | | 30 s | ~10 GB | | 45 s | ~12 GB | Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost. ## Usage in Mæstræa The Mæstræa runner clones [open-mmlab/Amphion](https://github.com/open-mmlab/Amphion) to `~/.maestraea/libs/amphion/` on first model load and imports the pipeline from `models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline`. The download manager pulls these weights to `~/.maestraea/models/vevo2/` and the runner resolves checkpoints from that directory at the paths shown in the table above. For the standalone Amphion path, see the [upstream Vevo2 README](https://github.com/open-mmlab/Amphion/blob/main/models/svc/vevo2/README.md). ## License MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.