---
license: mit
tags:
  - audio
  - voice-conversion
  - singing-voice
  - speech-synthesis
  - text-to-speech
  - vevo2
  - amphion
  - safetensors
  - maestraea
pipeline_tag: audio-to-audio
base_model: RMSnow/Vevo2
---

# Vevo2 Models (Mæstræa Mirror)

**Speech & Singing Voice Synthesis · Conversion · Editing · Style Transfer · Melody Control**

[Original Weights](https://huggingface.co/RMSnow/Vevo2) by [RMSnow](https://huggingface.co/RMSnow) · [Source Code](https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevo2) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License

> Mirror of the inference-only files from `RMSnow/Vevo2`, packaged for use with the [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). Training artifacts (`optimizer.pt`, `scheduler.pt`, `rng_state_*.pth`, `trainer_state.json`, `training_args.bin`, the `_text` FM variant, and the `pretrained` AR baseline) are dropped to keep the download lean. All credit for the model itself goes to the upstream authors.

## What's in This Repo

| Path | Description | Size |
|------|-------------|------|
| `contentstyle_modeling/posttrained/model.safetensors` | AR transformer (Qwen2.5-0.5B post-trained) | ~970 MB |
| `acoustic_modeling/fm_emilia101k_singnet7k_repa/model.safetensors` | Flow-Matching transformer (~350M params) | ~1.4 GB |
| `acoustic_modeling/fm_emilia101k_singnet7k_repa/whisper_stats.pt` | Per-channel mean/std for normed Whisper features | ~12 KB |
| `vocoder/model*.safetensors` | Vocos vocoder (~250M, sharded) | ~1.2 GB |
| `tokenizer/contentstyle_fvq16384_12.5hz/model.safetensors` | Content-style tokenizer (FVQ16384 @ 12.5 Hz) | ~234 MB |
| `tokenizer/prosody_fvq512_6.25hz/model.safetensors` | Prosody tokenizer (FVQ512 @ 6.25 Hz) | ~261 MB |
| `contentstyle_modeling/posttrained/{tokenizer.json, vocab.json, …}` | AR text tokenizer + configs | ~22 MB |
| `*/config.json`, `amphion_config.json`, etc. | Per-component configs | small |

**Total: ~4 GB**

> Whisper-medium (~1.5 GB, used by the content-style tokenizer at inference time) is **not** mirrored here — `openai-whisper` will pull it to `~/.cache/whisper` on first run.

## What Vevo2 Does

Vevo2 is a unified, controllable speech-and-singing voice generation system from the Amphion toolkit. The Mæstræa panel exposes six task tabs that all route to the same backend pipeline:

- **Convert** — voice/timbre conversion, FM-only (fastest)
- **TTS** — zero-shot text-to-speech / text-to-singing from a short reference clip
- **Edit** — rewrite words while preserving voice, melody, prosody, and style
- **Style** — singing style transfer (e.g. breathy → vibrato, pop → opera) preserving voice + melody
- **Melody** — sing target lyrics over a humming, whistled, or instrumental melody
- **SVC** — full singing voice conversion via the AR + FM pipeline (deeper than Convert)

### Architecture

- **AR Model** (Qwen2.5-0.5B post-trained) — autoregressive content-style modeling
- **Flow-Matching Transformer** (~350M) — acoustic generation
- **Vocos Vocoder** (~250M) — high-quality 24 kHz waveform synthesis
- **Content-style + Prosody Tokenizers** — FVQ codecs over Whisper / chromagram features

### VRAM Requirements

| Reference Length | VRAM (GPU, FP16) |
|------------------|------------------|
| 15 s             | ~6 GB            |
| 30 s             | ~10 GB           |
| 45 s             | ~12 GB           |

Recommendation: keep the timbre reference between 15–45 s. Longer references buy more identity fidelity at a real VRAM cost.

## Usage in Mæstræa

The Mæstræa runner clones [open-mmlab/Amphion](https://github.com/open-mmlab/Amphion) to `~/.maestraea/libs/amphion/` on first model load and imports the pipeline from `models.svc.vevo2.vevo2_utils.Vevo2InferencePipeline`. The download manager pulls these weights to `~/.maestraea/models/vevo2/` and the runner resolves checkpoints from that directory at the paths shown in the table above.

For the standalone Amphion path, see the [upstream Vevo2 README](https://github.com/open-mmlab/Amphion/blob/main/models/svc/vevo2/README.md).

## License

MIT, inherited from upstream. Commercial use of generated audio is permitted. Don't clone someone's voice without their consent.