Akshat
/

TeaCache-flux-identity-StableAudioOpen

Text-to-Audio

English

Model card Files Files and versions

xet

Community

Akshat commited on Feb 12

Commit

23d04f6

verified ·

1 Parent(s): d9b19fc

Update README.md

Browse files

Files changed (1) hide show

README.md +67 -37

README.md CHANGED Viewed

@@ -27,67 +27,97 @@ Prompts were selected to intentionally stress known diffusion and caching weakne
 # Test Prompts & Rationale
-1️⃣ Rhythmic "Jitter" Test
-Prompt:
-    "140 BPM sharp techno drum loop, heavy kick, crisp hi-hats, 44.1kHz."
-    Why:
-    Stable Audio Open uses a timing-conditioned latent diffusion model to handle variable-length audio. Identity coefficients may fail to capture transient timing precisely when skipping steps, potentially causing rhythmic jitter or "beat drift".
-    Evaluation Criteria:
-    BPM accuracy
-    Timing jitter
-    Peak consistency
-    High-frequency crispness
-2️⃣ High-Frequency Clarity (Transient Test)
-Prompt:
-    "Glass shattering on a concrete floor, high-pitched shards, long reverb tail."
-    Why:
-    The model targets high-fidelity 44.1kHz output with energy in high frequencies. Identity coefficients may over-smooth these peaky regions, turning sharp "shards" into generic white-noise "whooshes."
-    Evaluation Criteria:
-    Transient detection count
-    Peak width (sharpness)
-    Spectral high-frequency ratio
-3️⃣ Ghosting & Artifact (Drone Test)
-Prompt:
-    "Ambient drone, slow evolving cinematic pads, deep bass, mystical atmosphere."
-    Why:
-    TeaCache reuses transformer residuals from previous steps to provide a 1.5x-2.0x speedup. Slow-moving signals may suffer from "ghosting" or rhythmic "wobbles" if the cache is updated too aggressively.
-    Evaluation Criteria:
-    Amplitude variance (Pulsing)
-    Spectral flatness
-    High-frequency noise
-4️⃣ Complex Layering Test
-Prompt:
-    "A man speaking in a crowded cafe with clinking plates and background jazz."
-    Why:
-    Stable Audio Open specifically finds it challenging to generate prompts with "connectors" and may omit sounds or merge layers. This tests if caching exacerbates the "muddying" of foreground speech against background noise.
-    Evaluation Criteria:
-    Stereo width
-    Speech presence band (300Hz–3kHz)
-    Dynamic variance
 ---

 # Test Prompts & Rationale
+The following prompts were intentionally designed to stress different structural and spectral weaknesses under TeaCache acceleration.
+---
+## 1️⃣ Rhythmic "Jitter" Test
+**Prompt:**
+> "140 BPM sharp techno drum loop, heavy kick, crisp hi-hats, 44.1kHz."
+**Why:**
+Stable Audio Open uses a timing-conditioned latent diffusion model to generate variable-length audio.
+When TeaCache skips updates, Identity coefficients may fail to preserve transient alignment precisely, potentially introducing rhythmic jitter or subtle beat drift.
+**Evaluation Criteria:**
+- BPM accuracy
+- Timing jitter (ms)
+- Peak consistency (kick uniformity)
+- High-frequency crispness
+---
+## 2️⃣ High-Frequency Clarity (Transient Test)
+**Prompt:**
+> "Glass shattering on a concrete floor, high-pitched shards, long reverb tail."
+**Why:**
+The model targets high-fidelity 44.1kHz stereo synthesis, where high-frequency energy is critical.
+Identity coefficients may over-smooth sharp spectral peaks, turning crisp impacts into broadband noise-like "whooshes."
+**Evaluation Criteria:**
+- Transient detection count
+- Average peak width (sharpness)
+- Spectral high-frequency ratio
+---
+## 3️⃣ Ghosting & Artifact (Drone Test)
+**Prompt:**
+> "Ambient drone, slow evolving cinematic pads, deep bass, mystical atmosphere."
+**Why:**
+TeaCache reuses transformer residuals from previous timesteps to achieve ~1.5×–2.0× speedup.
+Slow-moving, low-variation signals may suffer from:
+- Residual ghosting
+- Rhythmic wobble
+- Amplitude pumping
+if the cache update threshold is too aggressive.
+**Evaluation Criteria:**
+- Amplitude variance (pulsing detection)
+- Spectral flatness
+- High-frequency noise ratio
+---
+## 4️⃣ Complex Layering Test
+**Prompt:**
+> "A man speaking in a crowded cafe with clinking plates and background jazz."
+**Why:**
+Stable Audio Open is known to struggle with prompts involving connectors (multiple simultaneous sound sources).
+This test evaluates whether TeaCache acceleration exacerbates:
+- Foreground/background blending
+- Stereo image collapse
+- Layer loss
+**Evaluation Criteria:**
+- Stereo width (correlation)
+- Speech presence band (300Hz–3kHz energy)
+- Dynamic variance (layer separation clarity)
+---
 ---