AEmotionStudio commited on
Commit
41d886e
·
verified ·
1 Parent(s): e7980dd

Add README for Mæstræa mirror

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio
5
+ - voice-conversion
6
+ - singing-voice
7
+ - speech-synthesis
8
+ - vevo2
9
+ - amphion
10
+ - safetensors
11
+ - maestraea
12
+ pipeline_tag: audio-to-audio
13
+ base_model: amphion/Vevo
14
+ ---
15
+
16
+ # Vevo2 Models (Mæstræa Mirror)
17
+
18
+ **Singing Voice Synthesis, Conversion & Editing**
19
+
20
+ [Original Model](https://huggingface.co/amphion/Vevo) by [OpenMMLab / Amphion](https://github.com/open-mmlab/Amphion) · MIT License
21
+
22
+ > This is a mirror of the Vevo2 model weights for use with [Mæstræa AI Workstation](https://github.com/AEmotionStudio/Maestraea). All credits go to the original authors.
23
+
24
+ ## What's in This Repo
25
+
26
+ | Path | Description | Size |
27
+ |------|-------------|------|
28
+ | `contentstyle_modeling/PhoneToVq8192/model.safetensors` | AR model (Qwen2.5-0.5B, ~500M params) | ~2.5 GB |
29
+ | `contentstyle_modeling/Vq32ToVq8192/model.safetensors` | Style transfer model | ~1.5 GB |
30
+ | `acoustic_modeling/Vq8192ToMels/model.safetensors` | Flow matching model (~350M params) | ~1.4 GB |
31
+ | `acoustic_modeling/Vocoder/model*.safetensors` | Vocos vocoder (~250M params) | ~1 GB |
32
+ | `tokenizer/vq32/` | HuBERT tokenizer (pickle + config) | ~1.3 GB |
33
+ | `tokenizer/vq8192/model.safetensors` | VQ8192 tokenizer | ~200 MB |
34
+
35
+ **Total: ~8 GB**
36
+
37
+ ## What Vevo2 Does
38
+
39
+ Vevo2 is a state-of-the-art voice conversion and singing voice synthesis system from the Amphion toolkit. It supports:
40
+
41
+ - **Voice Conversion** — Transform vocals to a target voice/timbre
42
+ - **Singing Voice Synthesis** — Generate singing from text + melody
43
+ - **Speech Editing** — Modify speech content while preserving speaker identity
44
+ - **Zero-Shot TTS** — Generate speech in any voice from a short reference
45
+
46
+ ### Architecture
47
+
48
+ - **AR Model** (Qwen2.5-0.5B) — Autoregressive content-style modeling
49
+ - **FM Model** (~350M) — Flow matching for acoustic generation
50
+ - **Vocos Vocoder** (~250M) — High-quality waveform synthesis
51
+ - **Total: ~1.1B parameters**
52
+
53
+ ### VRAM Requirements
54
+
55
+ | Reference Length | VRAM |
56
+ |-----------------|------|
57
+ | 15s | ~8 GB |
58
+ | 30s | ~10 GB |
59
+ | 45s | ~12 GB |
60
+
61
+ Recommended: Keep reference audio to 15–45 seconds.
62
+
63
+ ## Usage with Mæstræa
64
+
65
+ These models are automatically downloaded by the Mæstræa AI Workstation backend. Place in:
66
+
67
+ ```
68
+ ~/.maestraea/models/vevo2/
69
+ ```
70
+
71
+ ## License
72
+
73
+ MIT — same as the original Amphion/Vevo2 release.
74
+
75
+ ## Credits
76
+
77
+ - **Model**: [Amphion Vevo2](https://github.com/open-mmlab/Amphion/tree/main/models/vc/vevo2)
78
+ - **Paper**: See [Amphion repository](https://github.com/open-mmlab/Amphion) for citation
79
+ - **Mirror by**: [AEmotionStudio](https://huggingface.co/AEmotionStudio)