AEmotionStudio commited on
Commit
72d8020
·
verified ·
1 Parent(s): 91b0e88

docs: README — add tts-large/ subfolder + aoi-ot attribution

Browse files
Files changed (1) hide show
  1. README.md +21 -4
README.md CHANGED
@@ -27,11 +27,17 @@ It exists so MAESTRO's downloader can fetch each variant (and its dependencies)
27
 
28
  ```
29
  vibevoice-models/
30
- ├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB)
31
  │ ├── config.json
32
  │ ├── preprocessor_config.json
33
  │ ├── model-0000{1..3}-of-00003.safetensors
34
  │ └── …
 
 
 
 
 
 
35
  ├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
36
  │ ├── config.json
37
  │ ├── model-0000{1..8}-of-00008.safetensors
@@ -46,6 +52,15 @@ vibevoice-models/
46
  ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
47
  ```
48
 
 
 
 
 
 
 
 
 
 
49
  ### About `realtime-0.5b/voices/*.pt`
50
 
51
  These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
@@ -58,7 +73,8 @@ This is the legacy research variant of VibeVoice ASR — the one that runs clean
58
 
59
  | Variant | Task | Languages | Max length | Notes |
60
  |---|---|---|---|---|
61
- | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags |
 
62
  | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
63
  | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
64
 
@@ -87,9 +103,10 @@ These mitigations are baked into the released weights and are preserved in this
87
 
88
  | Component | Source | License |
89
  |---|---|---|
90
- | Model weights | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
 
91
  | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
92
- | Inference code (TTS-1.5B) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
93
 
94
  ## Citation
95
 
 
27
 
28
  ```
29
  vibevoice-models/
30
+ ├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB, 64K ctx, ~90 min max output)
31
  │ ├── config.json
32
  │ ├── preprocessor_config.json
33
  │ ├── model-0000{1..3}-of-00003.safetensors
34
  │ └── …
35
+ ├── tts-large/ ← aoi-ot/VibeVoice-Large (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone)
36
+ │ ├── config.json
37
+ │ ├── preprocessor_config.json
38
+ │ ├── configuration.json
39
+ │ ├── model-0000{1..10}-of-00010.safetensors
40
+ │ └── …
41
  ├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
42
  │ ├── config.json
43
  │ ├── model-0000{1..8}-of-00008.safetensors
 
52
  ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
53
  ```
54
 
55
+ ### About `tts-large/`
56
+
57
+ This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at `microsoft/VibeVoice-Large` with full MIT license, then **removed the repo on 2025-09-05** along with the demo scripts (same RAI cleanup that removed `modeling_vibevoice_inference.py`). The MIT license remains in force on the released weights — this mirror sources from [`aoi-ot/VibeVoice-Large`](https://huggingface.co/aoi-ot/VibeVoice-Large), a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release.
58
+
59
+ **Differences from `tts-1.5b/`:**
60
+ - Same `Speaker N:` script format, same voice-cloning workflow (pass per-speaker reference clips)
61
+ - Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
62
+ - **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
63
+
64
  ### About `realtime-0.5b/voices/*.pt`
65
 
66
  These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then `torch.save`'d). They are **not flat tensor maps** so they cannot be converted to safetensors — they must be loaded with `torch.load(..., weights_only=False)`. Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
 
73
 
74
  | Variant | Task | Languages | Max length | Notes |
75
  |---|---|---|---|---|
76
+ | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
77
+ | `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality |
78
  | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
79
  | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
80
 
 
103
 
104
  | Component | Source | License |
105
  |---|---|---|
106
+ | Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
107
+ | Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
108
  | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
109
+ | Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
110
 
111
  ## Citation
112