Vevo2 Models (Mæstræa Mirror)

Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control

Original Weights by RMSnow · Source Code by OpenMMLab / Amphion · MIT License

Mirror of the inference-only files from RMSnow/Vevo2, packaged for use with the Mæstræa AI Workstation. Training artifacts (optimizer.pt, scheduler.pt, rng_state_*.pth, trainer_state.json, training_args.bin, the _text FM variant, and the pretrained AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors.

What's in This Repo

Path	Description	Size
`contentstyle_modeling/posttrained/model.safetensors`	AR transformer (Qwen2.5-0.5B post-trained)	~970 MB
`acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors`	Flow-Matching transformer (~350M params)	~1.4 GB
`acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt`	Per-channel mean/std for normed Whisper features	~12 KB
`vocoder/model*.safetensors`	Vocos vocoder (~250M, sharded)	~1.2 GB
`tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors`	Content-style tokenizer (FVQ16384 @ 12.5 Hz)	~234 MB
`tokenizer/prosody_fvq512_6.25hz/model.safetensors`	Prosody tokenizer (FVQ512 @ 6.25 Hz)	~261 MB
`contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …}`	AR text tokenizer + configs	~22 MB
`*/config.json`, `amphion_config.json`, etc.	Per-component configs	small

Total: ~4 GB

Whisper-medium (~~1.5 GB, used by the content-style tokenizer at inference time) is not mirrored here — openai-whisper will pull it to `~~/.cache/whisper` on first run.

What Vevo2 Does

Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline:

Convert — voice/timbre conversion, FM-only (fastest)
TTS — zero-shot text-to-speech / text-to-singing from a short reference clip
Edit — rewrite words while preserving voice, melody, prosody, and style
Style — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody
Melody — sing target lyrics over a humming, whistled, or instrumental melody
SVC — full singing voice conversion via the AR + FM pipeline (deeper than Convert)

Architecture

AR Model (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling
Flow-Matching Transformer (~350M) — acoustic generation
Vocos Vocoder (~250M) — high-quality 24 kHz waveform synthesis
Content-style + Prosody Tokenizers — FVQ codecs over Whisper / chromagram features

VRAM Requirements

Reference Length	VRAM (GPU, FP16)
15 s	~6 GB
30 s	~10 GB
45 s	~12 GB

Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost.

Usage in Mæstræa

The Mæstræa runner clones open-mmlab/Amphion to ~/.maestraea/libs/amphion/ on first model load and imports the pipeline from models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline. The download manager pulls these weights to ~/.maestraea/models/vevo2/ and the runner resolves checkpoints from that directory at the paths shown in the table above.

For the standalone Amphion path, see the upstream Vevo2 README.

License

MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for AEmotionStudio/vevo2-models

Base model

RMSnow/Vevo2

Finetuned

(1)

this model