Vevo2 Models (Mæstræa Mirror)
Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control
Original Weights by RMSnow · Source Code by OpenMMLab / Amphion · MIT License
Mirror of the inference-only files from
RMSnow/Vevo2, packaged for use with the Mæstræa AI Workstation. Training artifacts (optimizer.pt,scheduler.pt,rng_state_*.pth,trainer_state.json,training_args.bin, the_textFM variant, and thepretrainedAR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors.
What's in This Repo
| Path | Description | Size |
|---|---|---|
contentstyle_modeling/posttrained/model.safetensors |
AR transformer (Qwen2.5-0.5B post-trained) | ~970 MB |
acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors |
Flow-Matching transformer (~350M params) | ~1.4 GB |
acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt |
Per-channel mean/std for normed Whisper features | ~12 KB |
vocoder/model*.safetensors |
Vocos vocoder (~250M, sharded) | ~1.2 GB |
tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors |
Content-style tokenizer (FVQ16384 @ 12.5 Hz) | ~234 MB |
tokenizer/prosody_fvq512_6.25hz/model.safetensors |
Prosody tokenizer (FVQ512 @ 6.25 Hz) | ~261 MB |
contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …} |
AR text tokenizer + configs | ~22 MB |
*/config.json, amphion_config.json, etc. |
Per-component configs | small |
Total: ~4 GB
Whisper-medium (
1.5 GB, used by the content-style tokenizer at inference time) is not mirrored here —/.cache/whisper` on first run.openai-whisperwill pull it to `
What Vevo2 Does
Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline:
- Convert — voice/timbre conversion, FM-only (fastest)
- TTS — zero-shot text-to-speech / text-to-singing from a short reference clip
- Edit — rewrite words while preserving voice, melody, prosody, and style
- Style — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody
- Melody — sing target lyrics over a humming, whistled, or instrumental melody
- SVC — full singing voice conversion via the AR + FM pipeline (deeper than Convert)
Architecture
- AR Model (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling
- Flow-Matching Transformer (~350M) — acoustic generation
- Vocos Vocoder (~250M) — high-quality 24 kHz waveform synthesis
- Content-style + Prosody Tokenizers — FVQ codecs over Whisper / chromagram features
VRAM Requirements
| Reference Length | VRAM (GPU, FP16) |
|---|---|
| 15 s | ~6 GB |
| 30 s | ~10 GB |
| 45 s | ~12 GB |
Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost.
Usage in Mæstræa
The Mæstræa runner clones open-mmlab/Amphion to ~/.maestraea/libs/amphion/ on first model load and imports the pipeline from models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline. The download manager pulls these weights to ~/.maestraea/models/vevo2/ and the runner resolves checkpoints from that directory at the paths shown in the table above.
For the standalone Amphion path, see the upstream Vevo2 README.
License
MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.
Model tree for AEmotionStudio/vevo2-models
Base model
RMSnow/Vevo2