Add AIST-87M dual-audio footprint comparison
Browse files
README.md
CHANGED
|
@@ -112,6 +112,35 @@ This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
|
|
| 112 |
Broad diagnostic runs contain many task families that are not part of this
|
| 113 |
release gate.
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
## Architecture
|
| 116 |
|
| 117 |
```text
|
|
|
|
| 112 |
Broad diagnostic runs contain many task families that are not part of this
|
| 113 |
release gate.
|
| 114 |
|
| 115 |
+
## Runtime Footprint vs Dual-Audio Tower
|
| 116 |
+
|
| 117 |
+
`AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
|
| 118 |
+
audio branches with one merged native `mn20_as` EfficientAT encoder. The result
|
| 119 |
+
is a smaller deployed path with the same 1280d output contract.
|
| 120 |
+
|
| 121 |
+
| Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
|
| 122 |
+
|---|---:|---:|---:|
|
| 123 |
+
| Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
|
| 124 |
+
| Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
|
| 125 |
+
| Audio encoders | 1 | 2 | removes Whisper branch |
|
| 126 |
+
| Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
|
| 127 |
+
| Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
|
| 128 |
+
| Audio projection input width | 1,280 | 2,304 | -44.4% |
|
| 129 |
+
|
| 130 |
+
Exact-gate tradeoff against the same dual-audio local baseline:
|
| 131 |
+
|
| 132 |
+
| 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
|
| 133 |
+
|---|---:|---:|---:|
|
| 134 |
+
| Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
|
| 135 |
+
| WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
|
| 136 |
+
| SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
|
| 137 |
+
| SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |
|
| 138 |
+
|
| 139 |
+
These are footprint and local exact-gate measurements, not a universal latency
|
| 140 |
+
benchmark. Wall-clock speed still depends on runtime, device, batching, and
|
| 141 |
+
audio preprocessing, but the deployed model removes one audio encoder pass and
|
| 142 |
+
shrinks the audio projection path.
|
| 143 |
+
|
| 144 |
## Architecture
|
| 145 |
|
| 146 |
```text
|