Identity vs TangoFlux Coefficients with TeaCache Acceleration

Stable Audio Open 1.0 Evaluation (Tesla T4)

Abstract

This experiment evaluates Identity vs TangoFlux coefficients under TeaCache acceleration for Stable Audio Open 1.0.

Initial benchmarks showed that Identity provided better raw speedup.
However, to investigate quality trade-offs, we conducted structured stress tests across rhythmic, spectral, temporal, and multi-layer scenarios.

Prompts were selected to intentionally stress known diffusion and caching weaknesses.

Testing is done via feeding the output to gemini3(thinking) and then asking it to rate the generation!

Test Prompts & Rationale

The following prompts were intentionally designed to stress different structural and spectral weaknesses under TeaCache acceleration.

1️⃣ Rhythmic "Jitter" Test

Prompt:

"140 BPM sharp techno drum loop, heavy kick, crisp hi-hats, 44.1kHz."

Why:

Stable Audio Open uses a timing-conditioned latent diffusion model to generate variable-length audio.
When TeaCache skips updates, Identity coefficients may fail to preserve transient alignment precisely, potentially introducing rhythmic jitter or subtle beat drift.

Evaluation Criteria:

BPM accuracy
Timing jitter (ms)
Peak consistency (kick uniformity)
High-frequency crispness

2️⃣ High-Frequency Clarity (Transient Test)

Prompt:

"Glass shattering on a concrete floor, high-pitched shards, long reverb tail."

Why:

The model targets high-fidelity 44.1kHz stereo synthesis, where high-frequency energy is critical.
Identity coefficients may over-smooth sharp spectral peaks, turning crisp impacts into broadband noise-like "whooshes."

Evaluation Criteria:

Transient detection count
Average peak width (sharpness)
Spectral high-frequency ratio

3️⃣ Ghosting & Artifact (Drone Test)

Prompt:

"Ambient drone, slow evolving cinematic pads, deep bass, mystical atmosphere."

Why:

TeaCache reuses transformer residuals from previous timesteps to achieve ~1.5×–2.0× speedup.
Slow-moving, low-variation signals may suffer from:

Residual ghosting
Rhythmic wobble
Amplitude pumping

if the cache update threshold is too aggressive.

Evaluation Criteria:

Amplitude variance (pulsing detection)
Spectral flatness
High-frequency noise ratio

4️⃣ Complex Layering Test

Prompt:

"A man speaking in a crowded cafe with clinking plates and background jazz."

Why:

Stable Audio Open is known to struggle with prompts involving connectors (multiple simultaneous sound sources).
This test evaluates whether TeaCache acceleration exacerbates:

Foreground/background blending
Stereo image collapse
Layer loss

Evaluation Criteria:

Stereo width (correlation)
Speech presence band (300Hz–3kHz energy)
Dynamic variance (layer separation clarity)

Hardware & Configuration

GPU: Tesla T4 16GB
Total Steps: 50
Max Audio Length: 10 seconds
Precision: float16
TeaCache r1_threshold values tested:
- 0.2 (Stable / Conservative)
- 0.6 (Aggressive / Stress Test)

Results

Rhythm Test Results

File	BPM	Timing Jitter	Peak Consistency ↓	HF Crispness	Rating
rhythm_02_identity	140.0	21.2 ms	1066	62.8%	7/10
rhythm_02_tangoflux	140.5	23.1 ms	266	37.4%	9/10
rhythm_06_identity	140.1	6.9 ms	1164	53.1%	6/10
rhythm_06_tangoflux	143.7	40.1 ms	3663	66.1%	3/10

Analysis

Threshold 0.2

Identity (7/10)
On-grid tempo but high peak variance → inconsistent kick dynamics.
TangoFlux (9/10)
4× better peak consistency → uniform professional thump.
Slight BPM drift but stronger rhythmic lock.

Threshold 0.6 (Stress)

Identity (6/10)
Stable timing but reduced crispness → muffled sound.
TangoFlux (3/10)
Severe instability: extra beats, 40ms jitter, massive variance.
Audible rhythmic breakdown.

Rhythm Summary

TangoFlux excels at low thresholds.
Identity is more robust under aggressive caching.

Transient Test Results

File	Transients	Avg Peak Width ↓	HF Ratio	Rating
transient_02_identity	7	0.100 ms	37.5%	5/10
transient_02_tangoflux	3	0.076 ms	42.9%	8/10
transient_06_identity	3	0.060 ms	42.8%	7/10
transient_06_tangoflux	4	0.057 ms	53.4%	9/10

Analysis

Threshold 0.2

Identity: More peaks but blurred.
TangoFlux: Sharper and brighter impacts.

Threshold 0.6

Identity: Clean but simplified.
TangoFlux: Best sharpness and HF preservation.

Transient Summary

TangoFlux clearly superior for high-frequency bite and sharp impacts.

Layering Test Results

File	Stereo Width ↓	Speech Presence	Dynamic Variance	Rating
layering_02_identity	0.78	71.9%	0.081	8/10
layering_02_tangoflux	0.69	51.3%	0.040	6/10
layering_06_identity	0.54	65.2%	0.060	7/10
layering_06_tangoflux	1.00 (Mono)	40.1%	0.031	1/10

Analysis

Threshold 0.2

Identity preserves voice clarity and depth.
TangoFlux compresses layers.

Threshold 0.6

Identity fails gracefully.
TangoFlux collapses to mono and loses subject separation.

Layering Summary

Identity is significantly more reliable for complex environments.

🌊 Drone / Temporal Test Results

File	Amplitude Variance ↓	Spectral Flatness	HF Noise	Rating
drone_02_identity	0.038	0.081	14.2%	9/10
drone_02_tangoflux	0.045	0.076	13.5%	8/10
drone_06_identity	0.052	0.104	16.1%	7/10
drone_06_tangoflux	0.071	0.092	22.4%	4/10

Analysis

Threshold 0.2

Identity produces the most stable amplitude profile.

Threshold 0.6

TangoFlux introduces noticeable wobble and HF artifacts.

Drone Summary

Identity is superior for atmospheric and evolving pads.

Final Verdict (Tesla T4 Context)

Use Case	Recommended Coefficient
Sharp transients (glass, snares, impacts)	TangoFlux (r1=0.2)
Drum loops at conservative thresholds	TangoFlux (r1=0.2)
Multi-layered environments	Identity
Drones & atmospheric audio	Identity
Aggressive caching (r1=0.6+)	Identity (more stable)

Overall Rating: 8.5 / 10

The experiment demonstrates a clear trade-off:

TangoFlux preserves spectral sharpness and transient attack.
Identity provides better stability and graceful degradation under high caching thresholds.

On a Tesla T4 (limited compute), Identity offers a superior speed-to-quality stability balance for complex audio scenes.

Coefficient Comparison Summary

Feature	Identity	TangoFlux
Transient Response	Blurs sharp sounds	Preserves HF detail
Rhythmic Stability	Stable but dynamic variance	Excellent at low threshold, unstable high
Layer Separation	Strong foreground/background separation	Risk of muddiness or mono collapse
Temporal Flow	Minimal pulsing	Prone to wobble artifacts

Conclusion

TangoFlux is best for high-frequency, sharp, single-subject sounds.

Identity is more reliable for:

Multi-layered environments
Atmospheric content
High-threshold TeaCache usage
T4-constrained deployments

This experiment highlights how coefficient selection directly impacts structural vs spectral fidelity in accelerated diffusion audio pipelines.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Akshat/TeaCache-flux-identity-StableAudioOpen

Base model

stabilityai/stable-audio-open-1.0

Finetuned

(20)

this model