AEmotionStudio commited on
Commit
92848d0
·
verified ·
1 Parent(s): 389579e

docs: add README — layout, RAI, attribution

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - text-to-speech
8
+ - speech-recognition
9
+ - tts
10
+ - asr
11
+ - voice-cloning
12
+ - long-form
13
+ - multi-speaker
14
+ - streaming
15
+ - mirror
16
+ pipeline_tag: text-to-speech
17
+ library_name: transformers
18
+ ---
19
+
20
+ # VibeVoice — AEmotion Studio Mirror
21
+
22
+ This repository is a **MAESTRO-curated mirror** of Microsoft's [VibeVoice](https://github.com/microsoft/VibeVoice) family, with the long-form-TTS inference code restored from the [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) fork. All weights, code, and assets remain under the upstream **MIT License**.
23
+
24
+ It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with `allow_patterns` filtering, instead of spreading across three separate Microsoft HF repos plus GitHub.
25
+
26
+ ## Layout
27
+
28
+ ```
29
+ vibevoice-models/
30
+ ├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB)
31
+ │ ├── config.json
32
+ │ ├── preprocessor_config.json
33
+ │ ├── model-0000{1..3}-of-00003.safetensors
34
+ │ └── …
35
+ ├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
36
+ │ ├── config.json
37
+ │ ├── model-0000{1..8}-of-00008.safetensors
38
+ │ └── …
39
+ └── realtime-0.5b/ ← microsoft/VibeVoice-Realtime-0.5B (2.0 GB + 100 MB voices)
40
+ ├── config.json
41
+ ├── preprocessor_config.json
42
+ ├── model.safetensors
43
+ └── voices/ ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights)
44
+ ├── en-Carter_man.pt
45
+ ├── en-Frank_man.pt
46
+ ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
47
+ ```
48
+
49
+ ### About `realtime-0.5b/voices/*.pt`
50
+
51
+ These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
52
+
53
+ ### About `asr-7b/`
54
+
55
+ This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on `transformers>=4.51.3,<5.0.0`. Microsoft also publishes a `microsoft/VibeVoice-ASR-HF` repo with the cleaner `apply_transcription_request` API, but that variant requires `transformers>=5.3.0` which is not yet compatible with the rest of MAESTRO's model stack.
56
+
57
+ ## Variant capabilities
58
+
59
+ | Variant | Task | Languages | Max length | Notes |
60
+ |---|---|---|---|---|
61
+ | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags |
62
+ | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
63
+ | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
64
+
65
+ ## Responsible use (verbatim from upstream model cards)
66
+
67
+ > **VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.**
68
+ >
69
+ > The following are explicitly out of scope:
70
+ > - Voice impersonation without explicit, recorded consent
71
+ > - Disinformation or impersonation
72
+ > - Real-time or low-latency voice conversion for live deep-fakes
73
+ > - Generation in unsupported languages (non-English, non-Chinese)
74
+ > - Generation of background ambience, Foley, or music
75
+ > - Circumventing the watermark or audible disclaimer
76
+ >
77
+ > **We do not recommend using VibeVoice in commercial or real-world applications without further testing and development.** This model is intended for research and development purposes only.
78
+
79
+ To mitigate misuse, Microsoft has:
80
+ - Embedded an **audible disclaimer** ("This segment was generated by AI") in TTS outputs.
81
+ - Added an **imperceptible perceptual watermark** to all generated audio.
82
+ - Logged inference requests (hashed) for abuse-pattern detection.
83
+
84
+ These mitigations are baked into the released weights and are preserved in this mirror.
85
+
86
+ ## Attribution
87
+
88
+ | Component | Source | License |
89
+ |---|---|---|
90
+ | Model weights | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
91
+ | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
92
+ | Inference code (TTS-1.5B) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
93
+
94
+ ## Citation
95
+
96
+ ```bibtex
97
+ @misc{peng2025vibevoicetechnicalreport,
98
+ title = {VibeVoice Technical Report},
99
+ author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and
100
+ Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and
101
+ Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei},
102
+ year = {2025},
103
+ eprint = {2508.19205},
104
+ archivePrefix = {arXiv},
105
+ primaryClass = {cs.CL},
106
+ url = {https://arxiv.org/abs/2508.19205}
107
+ }
108
+ ```