Vevo2 Models (Mæstræa Mirror)

Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control

Original Weights by RMSnow · Source Code by OpenMMLab / Amphion · MIT License

Mirror of the inference-only files from RMSnow/Vevo2, packaged for use with the Mæstræa AI Workstation. Training artifacts (optimizer.pt, scheduler.pt, rng_state_*.pth, trainer_state.json, training_args.bin, the _text FM variant, and the pretrained AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors.

What's in This Repo

Path Description Size
contentstyle_modeling/posttrained/model.safetensors AR transformer (Qwen2.5-0.5B post-trained) ~970 MB
acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors Flow-Matching transformer (~350M params) ~1.4 GB
acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt Per-channel mean/std for normed Whisper features ~12 KB
vocoder/model*.safetensors Vocos vocoder (~250M, sharded) ~1.2 GB
tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors Content-style tokenizer (FVQ16384 @ 12.5 Hz) ~234 MB
tokenizer/prosody_fvq512_6.25hz/model.safetensors Prosody tokenizer (FVQ512 @ 6.25 Hz) ~261 MB
contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …} AR text tokenizer + configs ~22 MB
*/config.json, amphion_config.json, etc. Per-component configs small

Total: ~4 GB

Whisper-medium (1.5 GB, used by the content-style tokenizer at inference time) is not mirrored here — openai-whisper will pull it to `/.cache/whisper` on first run.

What Vevo2 Does

Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline:

  • Convert — voice/timbre conversion, FM-only (fastest)
  • TTS — zero-shot text-to-speech / text-to-singing from a short reference clip
  • Edit — rewrite words while preserving voice, melody, prosody, and style
  • Style — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody
  • Melody — sing target lyrics over a humming, whistled, or instrumental melody
  • SVC — full singing voice conversion via the AR + FM pipeline (deeper than Convert)

Architecture

  • AR Model (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling
  • Flow-Matching Transformer (~350M) — acoustic generation
  • Vocos Vocoder (~250M) — high-quality 24 kHz waveform synthesis
  • Content-style + Prosody Tokenizers — FVQ codecs over Whisper / chromagram features

VRAM Requirements

Reference Length VRAM (GPU, FP16)
15 s ~6 GB
30 s ~10 GB
45 s ~12 GB

Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost.

Usage in Mæstræa

The Mæstræa runner clones open-mmlab/Amphion to ~/.maestraea/libs/amphion/ on first model load and imports the pipeline from models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline. The download manager pulls these weights to ~/.maestraea/models/vevo2/ and the runner resolves checkpoints from that directory at the paths shown in the table above.

For the standalone Amphion path, see the upstream Vevo2 README.

License

MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEmotionStudio/vevo2-models

Base model

RMSnow/Vevo2
Finetuned
(1)
this model