gcoderw commited on
Commit
b1401d9
·
verified ·
1 Parent(s): 331efd2

Add AIST-87M dual-audio footprint comparison

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -112,6 +112,35 @@ This release is not presented as a generic MTEB/MIEB/MAEB leaderboard model.
112
  Broad diagnostic runs contain many task families that are not part of this
113
  release gate.
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ## Architecture
116
 
117
  ```text
 
112
  Broad diagnostic runs contain many task families that are not part of this
113
  release gate.
114
 
115
+ ## Runtime Footprint vs Dual-Audio Tower
116
+
117
+ `AIST-87M` replaces the dual-audio tower's separate EfficientAT + Whisper-Tiny
118
+ audio branches with one merged native `mn20_as` EfficientAT encoder. The result
119
+ is a smaller deployed path with the same 1280d output contract.
120
+
121
+ | Runtime surface | AIST-87M | AIST-95M dual-audio tower | Delta |
122
+ |---|---:|---:|---:|
123
+ | Loaded parameters | 87,118,774 | 95,315,959 | -8.6% |
124
+ | Safetensors artifact | 348.9 MB | 381.9 MB | -8.6% |
125
+ | Audio encoders | 1 | 2 | removes Whisper branch |
126
+ | Audio encoder parameters | 19,886,566 | 26,117,671 | -23.9% |
127
+ | Audio path parameters incl. projection | 32,193,126 | 40,390,311 | -20.3% |
128
+ | Audio projection input width | 1,280 | 2,304 | -44.4% |
129
+
130
+ Exact-gate tradeoff against the same dual-audio local baseline:
131
+
132
+ | 1280d exact-gate slice | AIST-87M | AIST-95M dual-audio tower | Delta |
133
+ |---|---:|---:|---:|
134
+ | Speech holdout audio-text R@1 avg | 0.724 | 0.582 | +0.142 |
135
+ | WavCaps FSD audio-text R@1 avg | 0.097 | 0.105 | -0.009 |
136
+ | SALT audio-text R@1 avg | 0.008 | 0.007 | flat |
137
+ | SALT image-audio R@1 avg | 0.138 | 0.148 | -0.010 |
138
+
139
+ These are footprint and local exact-gate measurements, not a universal latency
140
+ benchmark. Wall-clock speed still depends on runtime, device, batching, and
141
+ audio preprocessing, but the deployed model removes one audio encoder pass and
142
+ shrinks the audio projection path.
143
+
144
  ## Architecture
145
 
146
  ```text