AS-20M

AS-20M is a standalone audio + speech embedding encoder for human-memory augmentation workloads. It uses a native mn20_as EfficientAT backbone with the speech/audio LoRA training merged into the released weights, so inference does not require loading a separate adapter.

Canonical name:

  • AS = audio + speech
  • 20M = 19,837,720 loaded parameters, rounded to integer millions

Runtime Contract

Input is mono audio resampled to 32 kHz. The expected preprocessing is the EfficientAT mel frontend used during training:

  • sample rate: 32000
  • FFT: 1024
  • window length: 800
  • hop size: 320
  • mel bins: 128

The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles, truncate and renormalize:

z1280 = l2norm(model(audio))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
z256  = l2norm(z1280[0:256])
z128  = l2norm(z1280[0:128])

Artifacts

  • AS-20M.safetensors: standalone native EfficientAT embedding model
  • config.json: release and architecture metadata
  • preprocessor_config.json: waveform and mel frontend contract
  • manifest.json: file hashes and source checkpoint lineage

Training Summary

This checkpoint was continued from the balanced native mn20_as student and trained on an audio-heavy mix of synthetic speech/audio alignment data. The published artifact contains merged weights, not a runtime LoRA adapter.

Source checkpoint:

triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Merged LoRA source:

triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Local Gate Metrics

The checkpoint-local heldout gate reported audio-side consistency metrics:

Metric Score
audio cosine 0.8108
embedding Pearson 0.7953
similarity Pearson 0.8853

Internal training runs also tracked text-audio retrieval against a companion text embedding space. Those numbers are not reported here as standalone model capabilities because this release artifact does not include a text encoder.

MAEB Audio-Only Comparison

This comparison uses the same 20 MAEB audio-only tasks for all three standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded because base mn20_as and Whisper-Tiny do not include a compatible text encoder; no text adapters were invented for those baselines.

Validation: each run completed 20/20 tasks with exception_count=0.

Model Params Native output Mean primary
base mn20_as 17.9M 1920d audio feature 0.3977
Whisper-Tiny encoder 8.2M encoder / 37.8M full 384d pooled encoder state 0.3320
AS-20M 19.8M 1280d embedding 0.4083
Task base mn20_as Whisper-Tiny AS-20M
BeijingOpera 0.8470 0.5933 0.8349
BirdCLEF 0.2070 0.0730 0.1730
CREMADPairClassification 0.5458 0.5752 0.5475
CREMA_D 0.2804 0.2995 0.3351
CREMA_DClustering 0.0229 0.0955 0.0943
CommonLanguageAgeDetection 0.1401 0.2108 0.1799
FSD2019Kaggle 0.5734 0.0964 0.6230
GTZANAudioReranking 0.8298 0.6340 0.7747
GTZANGenre 0.8260 0.4550 0.7300
IEMOCAPGender 0.7790 0.5269 0.7712
JamAltArtistA2ARetrieval 0.8981 0.6786 0.8490
MInDS14 0.0818 0.1057 0.0967
MridinghamTonic 0.3434 0.3080 0.3450
NMSQAPairClassification 0.4714 0.4360 0.5875
SIBFLEURS 0.1515 0.1554 0.1456
VehicleSoundClustering 0.0065 0.1194 0.0162
VoxCelebSA 0.2377 0.1673 0.2601
VoxPopuliAccentPairClassification 0.5158 0.5196 0.5235
VoxPopuliGenderClustering 0.0057 0.0008 0.0014
VoxPopuliLanguageID 0.1900 0.5900 0.2780

Interpretation: AS-20M is slightly ahead on the 20-task audio-only mean, while base mn20_as remains stronger on several music/general-audio tasks. Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is not a general audio embedding model and is weaker on broad environmental-audio coverage in this comparison.

Artifacts:

  • triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md
  • triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json

Limitations

AS-20M is an audio embedding model only. It does not transcribe speech, classify audio events directly, or embed text. Text-audio retrieval requires a separate compatible text encoder/head that is not included in this release artifact.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support