AS-20M

AS-20M is a standalone audio + speech embedding encoder for human-memory augmentation workloads. It uses a native mn20_as EfficientAT backbone with the speech/audio LoRA training merged into the released weights, so inference does not require loading a separate adapter.

Canonical name:

AS = audio + speech
20M = 19,837,720 loaded parameters, rounded to integer millions

Runtime Contract

Input is mono audio resampled to 32 kHz. The expected preprocessing is the EfficientAT mel frontend used during training:

sample rate: 32000
FFT: 1024
window length: 800
hop size: 320
mel bins: 128

The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles, truncate and renormalize:

z1280 = l2norm(model(audio))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
z256  = l2norm(z1280[0:256])
z128  = l2norm(z1280[0:128])

Artifacts

AS-20M.safetensors: standalone native EfficientAT embedding model
config.json: release and architecture metadata
preprocessor_config.json: waveform and mel frontend contract
manifest.json: file hashes and source checkpoint lineage

Training Summary

This checkpoint was continued from the balanced native mn20_as student and trained on an audio-heavy mix of synthetic speech/audio alignment data. The published artifact contains merged weights, not a runtime LoRA adapter.

Source checkpoint:

triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Merged LoRA source:

triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt

Local Gate Metrics

The checkpoint-local heldout gate reported audio-side consistency metrics:

Metric	Score
audio cosine	0.8108
embedding Pearson	0.7953
similarity Pearson	0.8853

Internal training runs also tracked text-audio retrieval against a companion text embedding space. Those numbers are not reported here as standalone model capabilities because this release artifact does not include a text encoder.

MAEB Audio-Only Comparison

This comparison uses the same 20 MAEB audio-only tasks for all three standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded because base mn20_as and Whisper-Tiny do not include a compatible text encoder; no text adapters were invented for those baselines.

Validation: each run completed 20/20 tasks with exception_count=0.

Model	Params	Native output	Mean primary
base `mn20_as`	17.9M	1920d audio feature	0.3977
Whisper-Tiny encoder	8.2M encoder / 37.8M full	384d pooled encoder state	0.3320
`AS-20M`	19.8M	1280d embedding	0.4083

Task	base `mn20_as`	Whisper-Tiny	`AS-20M`
BeijingOpera	0.8470	0.5933	0.8349
BirdCLEF	0.2070	0.0730	0.1730
CREMADPairClassification	0.5458	0.5752	0.5475
CREMA_D	0.2804	0.2995	0.3351
CREMA_DClustering	0.0229	0.0955	0.0943
CommonLanguageAgeDetection	0.1401	0.2108	0.1799
FSD2019Kaggle	0.5734	0.0964	0.6230
GTZANAudioReranking	0.8298	0.6340	0.7747
GTZANGenre	0.8260	0.4550	0.7300
IEMOCAPGender	0.7790	0.5269	0.7712
JamAltArtistA2ARetrieval	0.8981	0.6786	0.8490
MInDS14	0.0818	0.1057	0.0967
MridinghamTonic	0.3434	0.3080	0.3450
NMSQAPairClassification	0.4714	0.4360	0.5875
SIBFLEURS	0.1515	0.1554	0.1456
VehicleSoundClustering	0.0065	0.1194	0.0162
VoxCelebSA	0.2377	0.1673	0.2601
VoxPopuliAccentPairClassification	0.5158	0.5196	0.5235
VoxPopuliGenderClustering	0.0057	0.0008	0.0014
VoxPopuliLanguageID	0.1900	0.5900	0.2780

Interpretation: AS-20M is slightly ahead on the 20-task audio-only mean, while base mn20_as remains stronger on several music/general-audio tasks. Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is not a general audio embedding model and is weaker on broad environmental-audio coverage in this comparison.

Artifacts:

triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md
triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json

Limitations

AS-20M is an audio embedding model only. It does not transcribe speech, classify audio events directly, or embed text. Text-audio retrieval requires a separate compatible text encoder/head that is not included in this release artifact.

Downloads last month: 23