Update README.md
Browse files
README.md
CHANGED
|
@@ -27,67 +27,97 @@ Prompts were selected to intentionally stress known diffusion and caching weakne
|
|
| 27 |
|
| 28 |
# Test Prompts & Rationale
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
Why:
|
| 36 |
-
Stable Audio Open uses a timing-conditioned latent diffusion model to handle variable-length audio. Identity coefficients may fail to capture transient timing precisely when skipping steps, potentially causing rhythmic jitter or "beat drift".
|
| 37 |
-
Evaluation Criteria:
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
-
Why:
|
| 53 |
-
The model targets high-fidelity 44.1kHz output with energy in high frequencies. Identity coefficients may over-smooth these peaky regions, turning sharp "shards" into generic white-noise "whooshes."
|
| 54 |
-
Evaluation Criteria:
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
-
Why:
|
| 68 |
-
TeaCache reuses transformer residuals from previous steps to provide a 1.5x-2.0x speedup. Slow-moving signals may suffer from "ghosting" or rhythmic "wobbles" if the cache is updated too aggressively.
|
| 69 |
-
Evaluation Criteria:
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
-
Why:
|
| 83 |
-
Stable Audio Open specifically finds it challenging to generate prompts with "connectors" and may omit sounds or merge layers. This tests if caching exacerbates the "muddying" of foreground speech against background noise.
|
| 84 |
-
Evaluation Criteria:
|
| 85 |
|
| 86 |
-
|
|
|
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
Dynamic variance
|
| 91 |
|
| 92 |
---
|
| 93 |
|
|
|
|
| 27 |
|
| 28 |
# Test Prompts & Rationale
|
| 29 |
|
| 30 |
+
The following prompts were intentionally designed to stress different structural and spectral weaknesses under TeaCache acceleration.
|
| 31 |
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 1️⃣ Rhythmic "Jitter" Test
|
| 35 |
+
|
| 36 |
+
**Prompt:**
|
| 37 |
+
|
| 38 |
+
> "140 BPM sharp techno drum loop, heavy kick, crisp hi-hats, 44.1kHz."
|
| 39 |
+
|
| 40 |
+
**Why:**
|
| 41 |
+
|
| 42 |
+
Stable Audio Open uses a timing-conditioned latent diffusion model to generate variable-length audio.
|
| 43 |
+
When TeaCache skips updates, Identity coefficients may fail to preserve transient alignment precisely, potentially introducing rhythmic jitter or subtle beat drift.
|
| 44 |
+
|
| 45 |
+
**Evaluation Criteria:**
|
| 46 |
+
|
| 47 |
+
- BPM accuracy
|
| 48 |
+
- Timing jitter (ms)
|
| 49 |
+
- Peak consistency (kick uniformity)
|
| 50 |
+
- High-frequency crispness
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 2️⃣ High-Frequency Clarity (Transient Test)
|
| 55 |
|
| 56 |
+
**Prompt:**
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
> "Glass shattering on a concrete floor, high-pitched shards, long reverb tail."
|
| 59 |
|
| 60 |
+
**Why:**
|
| 61 |
|
| 62 |
+
The model targets high-fidelity 44.1kHz stereo synthesis, where high-frequency energy is critical.
|
| 63 |
+
Identity coefficients may over-smooth sharp spectral peaks, turning crisp impacts into broadband noise-like "whooshes."
|
| 64 |
|
| 65 |
+
**Evaluation Criteria:**
|
| 66 |
|
| 67 |
+
- Transient detection count
|
| 68 |
+
- Average peak width (sharpness)
|
| 69 |
+
- Spectral high-frequency ratio
|
| 70 |
|
| 71 |
+
---
|
| 72 |
|
| 73 |
+
## 3️⃣ Ghosting & Artifact (Drone Test)
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
**Prompt:**
|
| 76 |
|
| 77 |
+
> "Ambient drone, slow evolving cinematic pads, deep bass, mystical atmosphere."
|
| 78 |
|
| 79 |
+
**Why:**
|
| 80 |
|
| 81 |
+
TeaCache reuses transformer residuals from previous timesteps to achieve ~1.5×–2.0× speedup.
|
| 82 |
+
Slow-moving, low-variation signals may suffer from:
|
| 83 |
|
| 84 |
+
- Residual ghosting
|
| 85 |
+
- Rhythmic wobble
|
| 86 |
+
- Amplitude pumping
|
| 87 |
|
| 88 |
+
if the cache update threshold is too aggressive.
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
+
**Evaluation Criteria:**
|
| 91 |
|
| 92 |
+
- Amplitude variance (pulsing detection)
|
| 93 |
+
- Spectral flatness
|
| 94 |
+
- High-frequency noise ratio
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
|
| 98 |
+
## 4️⃣ Complex Layering Test
|
| 99 |
|
| 100 |
+
**Prompt:**
|
| 101 |
|
| 102 |
+
> "A man speaking in a crowded cafe with clinking plates and background jazz."
|
| 103 |
|
| 104 |
+
**Why:**
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
+
Stable Audio Open is known to struggle with prompts involving connectors (multiple simultaneous sound sources).
|
| 107 |
+
This test evaluates whether TeaCache acceleration exacerbates:
|
| 108 |
|
| 109 |
+
- Foreground/background blending
|
| 110 |
+
- Stereo image collapse
|
| 111 |
+
- Layer loss
|
| 112 |
+
|
| 113 |
+
**Evaluation Criteria:**
|
| 114 |
+
|
| 115 |
+
- Stereo width (correlation)
|
| 116 |
+
- Speech presence band (300Hz–3kHz energy)
|
| 117 |
+
- Dynamic variance (layer separation clarity)
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
|
|
|
|
| 121 |
|
| 122 |
---
|
| 123 |
|