AS-20M
AS-20M is a standalone audio + speech embedding encoder for
human-memory augmentation workloads. It uses a native mn20_as EfficientAT
backbone with the speech/audio LoRA training merged into the released weights,
so inference does not require loading a separate adapter.
Canonical name:
AS= audio + speech20M= 19,837,720 loaded parameters, rounded to integer millions
Runtime Contract
Input is mono audio resampled to 32 kHz. The expected preprocessing is the EfficientAT mel frontend used during training:
- sample rate:
32000 - FFT:
1024 - window length:
800 - hop size:
320 - mel bins:
128
The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles, truncate and renormalize:
z1280 = l2norm(model(audio))
z768 = l2norm(z1280[0:768])
z512 = l2norm(z1280[0:512])
z256 = l2norm(z1280[0:256])
z128 = l2norm(z1280[0:128])
Artifacts
AS-20M.safetensors: standalone native EfficientAT embedding modelconfig.json: release and architecture metadatapreprocessor_config.json: waveform and mel frontend contractmanifest.json: file hashes and source checkpoint lineage
Training Summary
This checkpoint was continued from the balanced native mn20_as student and
trained on an audio-heavy mix of synthetic speech/audio alignment data. The
published artifact contains merged weights, not a runtime LoRA adapter.
Source checkpoint:
triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
Merged LoRA source:
triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
Local Gate Metrics
The checkpoint-local heldout gate reported audio-side consistency metrics:
| Metric | Score |
|---|---|
| audio cosine | 0.8108 |
| embedding Pearson | 0.7953 |
| similarity Pearson | 0.8853 |
Internal training runs also tracked text-audio retrieval against a companion text embedding space. Those numbers are not reported here as standalone model capabilities because this release artifact does not include a text encoder.
MAEB Audio-Only Comparison
This comparison uses the same 20 MAEB audio-only tasks for all three
standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
because base mn20_as and Whisper-Tiny do not include a compatible text
encoder; no text adapters were invented for those baselines.
Validation: each run completed 20/20 tasks with exception_count=0.
| Model | Params | Native output | Mean primary |
|---|---|---|---|
base mn20_as |
17.9M | 1920d audio feature | 0.3977 |
| Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 |
AS-20M |
19.8M | 1280d embedding | 0.4083 |
| Task | base mn20_as |
Whisper-Tiny | AS-20M |
|---|---|---|---|
| BeijingOpera | 0.8470 | 0.5933 | 0.8349 |
| BirdCLEF | 0.2070 | 0.0730 | 0.1730 |
| CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 |
| CREMA_D | 0.2804 | 0.2995 | 0.3351 |
| CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 |
| CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 |
| FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 |
| GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 |
| GTZANGenre | 0.8260 | 0.4550 | 0.7300 |
| IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 |
| JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 |
| MInDS14 | 0.0818 | 0.1057 | 0.0967 |
| MridinghamTonic | 0.3434 | 0.3080 | 0.3450 |
| NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 |
| SIBFLEURS | 0.1515 | 0.1554 | 0.1456 |
| VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 |
| VoxCelebSA | 0.2377 | 0.1673 | 0.2601 |
| VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 |
| VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 |
| VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 |
Interpretation: AS-20M is slightly ahead on the 20-task audio-only mean,
while base mn20_as remains stronger on several music/general-audio tasks.
Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
not a general audio embedding model and is weaker on broad environmental-audio
coverage in this comparison.
Artifacts:
triembed/results/maeb_audio_only_3model_final_20260505T215838Z.mdtriembed/results/maeb_audio_only_3model_final_20260505T215838Z.json
Limitations
AS-20M is an audio embedding model only. It does not transcribe speech,
classify audio events directly, or embed text. Text-audio retrieval requires
a separate compatible text encoder/head that is not included in this release
artifact.
- Downloads last month
- 23