Identity vs TangoFlux Coefficients with TeaCache Acceleration

Stable Audio Open 1.0 Evaluation (Tesla T4)


Abstract

This experiment evaluates Identity vs TangoFlux coefficients under TeaCache acceleration for Stable Audio Open 1.0.

Initial benchmarks showed that Identity provided better raw speedup.
However, to investigate quality trade-offs, we conducted structured stress tests across rhythmic, spectral, temporal, and multi-layer scenarios.

Prompts were selected to intentionally stress known diffusion and caching weaknesses.

Testing is done via feeding the output to gemini3(thinking) and then asking it to rate the generation!


Test Prompts & Rationale

The following prompts were intentionally designed to stress different structural and spectral weaknesses under TeaCache acceleration.


1️⃣ Rhythmic "Jitter" Test

Prompt:

"140 BPM sharp techno drum loop, heavy kick, crisp hi-hats, 44.1kHz."

Why:

Stable Audio Open uses a timing-conditioned latent diffusion model to generate variable-length audio.
When TeaCache skips updates, Identity coefficients may fail to preserve transient alignment precisely, potentially introducing rhythmic jitter or subtle beat drift.

Evaluation Criteria:

  • BPM accuracy
  • Timing jitter (ms)
  • Peak consistency (kick uniformity)
  • High-frequency crispness

2️⃣ High-Frequency Clarity (Transient Test)

Prompt:

"Glass shattering on a concrete floor, high-pitched shards, long reverb tail."

Why:

The model targets high-fidelity 44.1kHz stereo synthesis, where high-frequency energy is critical.
Identity coefficients may over-smooth sharp spectral peaks, turning crisp impacts into broadband noise-like "whooshes."

Evaluation Criteria:

  • Transient detection count
  • Average peak width (sharpness)
  • Spectral high-frequency ratio

3️⃣ Ghosting & Artifact (Drone Test)

Prompt:

"Ambient drone, slow evolving cinematic pads, deep bass, mystical atmosphere."

Why:

TeaCache reuses transformer residuals from previous timesteps to achieve ~1.5×–2.0× speedup.
Slow-moving, low-variation signals may suffer from:

  • Residual ghosting
  • Rhythmic wobble
  • Amplitude pumping

if the cache update threshold is too aggressive.

Evaluation Criteria:

  • Amplitude variance (pulsing detection)
  • Spectral flatness
  • High-frequency noise ratio

4️⃣ Complex Layering Test

Prompt:

"A man speaking in a crowded cafe with clinking plates and background jazz."

Why:

Stable Audio Open is known to struggle with prompts involving connectors (multiple simultaneous sound sources).
This test evaluates whether TeaCache acceleration exacerbates:

  • Foreground/background blending
  • Stereo image collapse
  • Layer loss

Evaluation Criteria:

  • Stereo width (correlation)
  • Speech presence band (300Hz–3kHz energy)
  • Dynamic variance (layer separation clarity)


Hardware & Configuration

  • GPU: Tesla T4 16GB
  • Total Steps: 50
  • Max Audio Length: 10 seconds
  • Precision: float16
  • TeaCache r1_threshold values tested:
    • 0.2 (Stable / Conservative)
    • 0.6 (Aggressive / Stress Test)

Results


Rhythm Test Results

File BPM Timing Jitter Peak Consistency ↓ HF Crispness Rating
rhythm_02_identity 140.0 21.2 ms 1066 62.8% 7/10
rhythm_02_tangoflux 140.5 23.1 ms 266 37.4% 9/10
rhythm_06_identity 140.1 6.9 ms 1164 53.1% 6/10
rhythm_06_tangoflux 143.7 40.1 ms 3663 66.1% 3/10

Analysis

Threshold 0.2

  • Identity (7/10)
    On-grid tempo but high peak variance → inconsistent kick dynamics.

  • TangoFlux (9/10)
    4× better peak consistency → uniform professional thump.
    Slight BPM drift but stronger rhythmic lock.

Threshold 0.6 (Stress)

  • Identity (6/10)
    Stable timing but reduced crispness → muffled sound.

  • TangoFlux (3/10)
    Severe instability: extra beats, 40ms jitter, massive variance.
    Audible rhythmic breakdown.

Rhythm Summary

TangoFlux excels at low thresholds.
Identity is more robust under aggressive caching.


Transient Test Results

File Transients Avg Peak Width ↓ HF Ratio Rating
transient_02_identity 7 0.100 ms 37.5% 5/10
transient_02_tangoflux 3 0.076 ms 42.9% 8/10
transient_06_identity 3 0.060 ms 42.8% 7/10
transient_06_tangoflux 4 0.057 ms 53.4% 9/10

Analysis

Threshold 0.2

  • Identity: More peaks but blurred.
  • TangoFlux: Sharper and brighter impacts.

Threshold 0.6

  • Identity: Clean but simplified.
  • TangoFlux: Best sharpness and HF preservation.

Transient Summary

TangoFlux clearly superior for high-frequency bite and sharp impacts.


Layering Test Results

File Stereo Width ↓ Speech Presence Dynamic Variance Rating
layering_02_identity 0.78 71.9% 0.081 8/10
layering_02_tangoflux 0.69 51.3% 0.040 6/10
layering_06_identity 0.54 65.2% 0.060 7/10
layering_06_tangoflux 1.00 (Mono) 40.1% 0.031 1/10

Analysis

Threshold 0.2

  • Identity preserves voice clarity and depth.
  • TangoFlux compresses layers.

Threshold 0.6

  • Identity fails gracefully.
  • TangoFlux collapses to mono and loses subject separation.

Layering Summary

Identity is significantly more reliable for complex environments.


🌊 Drone / Temporal Test Results

File Amplitude Variance ↓ Spectral Flatness HF Noise Rating
drone_02_identity 0.038 0.081 14.2% 9/10
drone_02_tangoflux 0.045 0.076 13.5% 8/10
drone_06_identity 0.052 0.104 16.1% 7/10
drone_06_tangoflux 0.071 0.092 22.4% 4/10

Analysis

Threshold 0.2

Identity produces the most stable amplitude profile.

Threshold 0.6

TangoFlux introduces noticeable wobble and HF artifacts.

Drone Summary

Identity is superior for atmospheric and evolving pads.


Final Verdict (Tesla T4 Context)

Use Case Recommended Coefficient
Sharp transients (glass, snares, impacts) TangoFlux (r1=0.2)
Drum loops at conservative thresholds TangoFlux (r1=0.2)
Multi-layered environments Identity
Drones & atmospheric audio Identity
Aggressive caching (r1=0.6+) Identity (more stable)

Overall Rating: 8.5 / 10

The experiment demonstrates a clear trade-off:

  • TangoFlux preserves spectral sharpness and transient attack.
  • Identity provides better stability and graceful degradation under high caching thresholds.

On a Tesla T4 (limited compute), Identity offers a superior speed-to-quality stability balance for complex audio scenes.


Coefficient Comparison Summary

Feature Identity TangoFlux
Transient Response Blurs sharp sounds Preserves HF detail
Rhythmic Stability Stable but dynamic variance Excellent at low threshold, unstable high
Layer Separation Strong foreground/background separation Risk of muddiness or mono collapse
Temporal Flow Minimal pulsing Prone to wobble artifacts

Conclusion

TangoFlux is best for high-frequency, sharp, single-subject sounds.

Identity is more reliable for:

  • Multi-layered environments
  • Atmospheric content
  • High-threshold TeaCache usage
  • T4-constrained deployments

This experiment highlights how coefficient selection directly impacts structural vs spectral fidelity in accelerated diffusion audio pipelines.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Akshat/TeaCache-flux-identity-StableAudioOpen

Finetuned
(12)
this model