AEmotionStudio
/

vevo2-models

voice-conversion

speech-synthesis

Model card Files Files and versions

AEmotionStudio commited on Apr 12

Commit

41d886e

·

verified ·

1 Parent(s): e7980dd

Add README for Mæstræa mirror

Files changed (1) hide show

README.md +79 -0

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+---
+license: mit
+tags:
+  - audio
+  - voice-conversion
+  - singing-voice
+  - speech-synthesis
+  - vevo2
+  - amphion
+  - safetensors
+  - maestraea
+pipeline_tag: audio-to-audio
+base_model: amphion/Vevo
+---
+# Vevo2 Models (Mæstræa Mirror)
+**Singing Voice Synthesis, Conversion & Editing**
+[Original Model](https://huggingface.co/amphion/Vevo) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License
+> This is a mirror of the Vevo2 model weights for use with [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). All credits go to the original authors.
+## What's in This Repo
+| Path | Description | Size |
+|------|-------------|------|
+| `contentstyle_modeling/PhoneToVq8192/model.safetensors` | AR model (Qwen2.5-0.5B, ~500M params) | ~2.5 GB |
+| `contentstyle_modeling/Vq32ToVq8192/model.safetensors` | Style transfer model | ~1.5 GB |
+| `acoustic_modeling/Vq8192ToMels/model.safetensors` | Flow matching model (~350M params) | ~1.4 GB |
+| `acoustic_modeling/Vocoder/model*.safetensors` | Vocos vocoder (~250M params) | ~1 GB |
+| `tokenizer/vq32/` | HuBERT tokenizer (pickle + config) | ~1.3 GB |
+| `tokenizer/vq8192/model.safetensors` | VQ8192 tokenizer | ~200 MB |
+**Total: ~8 GB**
+## What Vevo2 Does
+Vevo2 is a state-of-the-art voice conversion and singing voice synthesis system from the Amphion toolkit. It supports:
+- **Voice Conversion** — Transform vocals to a target voice/timbre
+- **Singing Voice Synthesis** — Generate singing from text + melody
+- **Speech Editing** — Modify speech content while preserving speaker identity
+- **Zero-Shot TTS** — Generate speech in any voice from a short reference
+### Architecture
+- **AR Model** (Qwen2.5-0.5B) — Autoregressive content-style modeling
+- **FM Model** (~350M) — Flow matching for acoustic generation
+- **Vocos Vocoder** (~250M) — High-quality waveform synthesis
+- **Total: ~1.1B parameters**
+### VRAM Requirements
+| Reference Length | VRAM |
+|-----------------|------|
+| 15s | ~8 GB |
+| 30s | ~10 GB |
+| 45s | ~12 GB |
+Recommended: Keep reference audio to 15–45 seconds.
+## Usage with Mæstræa
+These models are automatically downloaded by the Mæstræa AI Workstation backend. Place in:
+```
+~/.maestraea/models/vevo2/
+```
+## License
+MIT — same as the original Amphion/Vevo2 release.
+## Credits
+- **Model**: [Amphion Vevo2](https://github.com/open-mmlab/Amphion/tree/main/models/vc/vevo2)
+- **Paper**: See [Amphion repository](https://github.com/open-mmlab/Amphion) for citation
+- **Mirror by**: [AEmotionStudio](https://huggingface.co/AEmotionStudio)