OpenMOSS-Team
/

MOSS-TTS-Local-Transformer

Model card Files Files and versions

fdugyt commited on Feb 9

Commit

a9555b0

·

verified ·

1 Parent(s): 94a9ba0

update readme

Files changed (1) hide show

README.md +12 -11

README.md CHANGED Viewed

@@ -423,17 +423,18 @@ with torch.no_grad():
-### Generation Hyperparameters
-| Parameter | Type | Default | Description |
-|---|---|---:|---|
-| `max_new_tokens` | `int` | — | Controls total generated audio tokens. Use duration rule: **1s ≈ 12.5 tokens**. |
-| `audio_temperature` | `float` | 1.7 | Higher values increase variation; lower values stabilize prosody. |
-| `audio_top_p` | `float` | 0.8 | Nucleus sampling cutoff. Lower values are more conservative. |
-| `audio_top_k` | `int` | 25 | Top-K sampling. Lower values tighten sampling space. |
-| `audio_repetition_penalty` | `float` | 1.0 | >1.0 discourages repeating patterns. |
-> Note: MOSS-TTS is a pretrained base model and is **sensitive to decoding hyperparameters**. See **Released Models** for recommended defaults.

+### Generation Hyperparameters (MOSS-TTS-Local)
+MOSS-TTSLocal utilizes `DelayGenerationConfig` to manage hierarchical sampling. Due to the **Progressive Sequence Dropout** training mechanism, the model supports variable bitrate inference by adjusting the RVQ depth.
+| Parameter | Type | Recommended (Audio Layers) | Description |
+| :--- | :--- | :---: | :--- |
+| `max_new_tokens` | `int` | — | Controls total generated audio tokens. **1s ≈ 12.5 tokens**. |
+| `n_vq_for_inference` | `int` | 32 | **RVQ Inference Depth**: Controls the number of codebook layers generated. Higher values (max 32) improve audio fidelity but slow down inference; lower values speed up inference but reduce audio quality. |
+| `audio_temperature` | `float` | 1.0 | Temperature for audio token layers (Layer 1+). Lower values ensure more stable and consistent acoustic reconstruction. |
+| `audio_top_p` | `float` | 0.95 | Nucleus sampling cutoff for audio layers. |
+| `audio_top_k` | `int` | 50 | Top-K sampling filter for audio layers. |
+| `audio_repetition_penalty` | `float` | 1.1 | Discourages repeating acoustic patterns. Values > 1.0 help prevent artifacts in long-form synthesis. |