chore: replace Vevo v1 weights with Vevo2 (RMSnow/Vevo2 inference subset)

e97c959 verified 8 days ago

4.3 kB

	---
	license: mit
	tags:
	- audio
	- voice-conversion
	- singing-voice
	- speech-synthesis
	- text-to-speech
	- vevo2
	- amphion
	- safetensors
	- maestraea
	pipeline_tag: audio-to-audio
	base_model: RMSnow/Vevo2
	---

	# Vevo2 Models (Mæstræa Mirror)

	Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control

	[Original Weights](https://huggingface.co/RMSnow/Vevo2) by [RMSnow](https://huggingface.co/RMSnow) · [Source Code](https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevo2) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License

	> Mirror of the inference-only files from `RMSnow/Vevo2`, packaged for use with the [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). Training artifacts (`optimizer.pt`, `scheduler.pt`, `rng_state_*.pth`, `trainer_state.json`, `training_args.bin`, the `_text` FM variant, and the `pretrained` AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors.

	## What's in This Repo

	\| Path \| Description \| Size \|
	\|------\|-------------\|------\|
	\| `contentstyle_modeling/posttrained/model.safetensors` \| AR transformer (Qwen2.5-0.5B post-trained) \| ~970 MB \|
	\| `acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors` \| Flow-Matching transformer (~350M params) \| ~1.4 GB \|
	\| `acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt` \| Per-channel mean/std for normed Whisper features \| ~12 KB \|
	\| `vocoder/model*.safetensors` \| Vocos vocoder (~250M, sharded) \| ~1.2 GB \|
	\| `tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors` \| Content-style tokenizer (FVQ16384 @ 12.5 Hz) \| ~234 MB \|
	\| `tokenizer/prosody_fvq512_6.25hz/model.safetensors` \| Prosody tokenizer (FVQ512 @ 6.25 Hz) \| ~261 MB \|
	\| `contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …}` \| AR text tokenizer + configs \| ~22 MB \|
	\| `*/config.json`, `amphion_config.json`, etc. \| Per-component configs \| small \|

	Total: ~4 GB

	> Whisper-medium (~1.5 GB, used by the content-style tokenizer at inference time) is not mirrored here — `openai-whisper` will pull it to `~/.cache/whisper` on first run.

	## What Vevo2 Does

	Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline:

	- Convert — voice/timbre conversion, FM-only (fastest)
	- TTS — zero-shot text-to-speech / text-to-singing from a short reference clip
	- Edit — rewrite words while preserving voice, melody, prosody, and style
	- Style — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody
	- Melody — sing target lyrics over a humming, whistled, or instrumental melody
	- SVC — full singing voice conversion via the AR + FM pipeline (deeper than Convert)

	### Architecture

	- AR Model (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling
	- Flow-Matching Transformer (~350M) — acoustic generation
	- Vocos Vocoder (~250M) — high-quality 24 kHz waveform synthesis
	- Content-style + Prosody Tokenizers — FVQ codecs over Whisper / chromagram features

	### VRAM Requirements

	\| Reference Length \| VRAM (GPU, FP16) \|
	\|------------------\|------------------\|
	\| 15 s \| ~6 GB \|
	\| 30 s \| ~10 GB \|
	\| 45 s \| ~12 GB \|

	Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost.

	## Usage in Mæstræa

	The Mæstræa runner clones [open-mmlab/Amphion](https://github.com/open-mmlab/Amphion) to `~/.maestraea/libs/amphion/` on first model load and imports the pipeline from `models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline`. The download manager pulls these weights to `~/.maestraea/models/vevo2/` and the runner resolves checkpoints from that directory at the paths shown in the table above.

	For the standalone Amphion path, see the [upstream Vevo2 README](https://github.com/open-mmlab/Amphion/blob/main/models/svc/vevo2/README.md).

	## License

	MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.