AS-20M / README.md

Clarify AS-20M standalone audio-only card

72ddab5 verified 5 days ago

4.86 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- audio
	- speech
	- embedding
	- retrieval
	- feature-extraction
	- efficientat
	- matryoshka
	- memory-augmentation
	library_name: pytorch
	pipeline_tag: feature-extraction
	datasets:
	- custom
	---

	# AS-20M

	`AS-20M` is a standalone audio + speech embedding encoder for
	human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT
	backbone with the speech/audio LoRA training merged into the released weights,
	so inference does not require loading a separate adapter.

	Canonical name:

	- `AS` = audio + speech
	- `20M` = 19,837,720 loaded parameters, rounded to integer millions

	## Runtime Contract

	Input is mono audio resampled to 32 kHz. The expected preprocessing is the
	EfficientAT mel frontend used during training:

	- sample rate: `32000`
	- FFT: `1024`
	- window length: `800`
	- hop size: `320`
	- mel bins: `128`

	The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles,
	truncate and renormalize:

	```text
	z1280 = l2norm(model(audio))
	z768 = l2norm(z1280[0:768])
	z512 = l2norm(z1280[0:512])
	z256 = l2norm(z1280[0:256])
	z128 = l2norm(z1280[0:128])
	```

	## Artifacts

	- `AS-20M.safetensors`: standalone native EfficientAT embedding model
	- `config.json`: release and architecture metadata
	- `preprocessor_config.json`: waveform and mel frontend contract
	- `manifest.json`: file hashes and source checkpoint lineage

	## Training Summary

	This checkpoint was continued from the balanced native `mn20_as` student and
	trained on an audio-heavy mix of synthetic speech/audio alignment data. The
	published artifact contains merged weights, not a runtime LoRA adapter.

	Source checkpoint:

	```text
	triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
	```

	Merged LoRA source:

	```text
	triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
	```

	## Local Gate Metrics

	The checkpoint-local heldout gate reported audio-side consistency metrics:

	\| Metric \| Score \|
	\|---\|---:\|
	\| audio cosine \| 0.8108 \|
	\| embedding Pearson \| 0.7953 \|
	\| similarity Pearson \| 0.8853 \|

	Internal training runs also tracked text-audio retrieval against a companion
	text embedding space. Those numbers are not reported here as standalone model
	capabilities because this release artifact does not include a text encoder.

	## MAEB Audio-Only Comparison

	This comparison uses the same 20 MAEB audio-only tasks for all three
	standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
	because base `mn20_as` and Whisper-Tiny do not include a compatible text
	encoder; no text adapters were invented for those baselines.

	Validation: each run completed 20/20 tasks with `exception_count=0`.

	\| Model \| Params \| Native output \| Mean primary \|
	\|---\|---:\|---:\|---:\|
	\| base `mn20_as` \| 17.9M \| 1920d audio feature \| 0.3977 \|
	\| Whisper-Tiny encoder \| 8.2M encoder / 37.8M full \| 384d pooled encoder state \| 0.3320 \|
	\| `AS-20M` \| 19.8M \| 1280d embedding \| 0.4083 \|

	\| Task \| base `mn20_as` \| Whisper-Tiny \| `AS-20M` \|
	\|---\|---:\|---:\|---:\|
	\| BeijingOpera \| 0.8470 \| 0.5933 \| 0.8349 \|
	\| BirdCLEF \| 0.2070 \| 0.0730 \| 0.1730 \|
	\| CREMADPairClassification \| 0.5458 \| 0.5752 \| 0.5475 \|
	\| CREMA_D \| 0.2804 \| 0.2995 \| 0.3351 \|
	\| CREMA_DClustering \| 0.0229 \| 0.0955 \| 0.0943 \|
	\| CommonLanguageAgeDetection \| 0.1401 \| 0.2108 \| 0.1799 \|
	\| FSD2019Kaggle \| 0.5734 \| 0.0964 \| 0.6230 \|
	\| GTZANAudioReranking \| 0.8298 \| 0.6340 \| 0.7747 \|
	\| GTZANGenre \| 0.8260 \| 0.4550 \| 0.7300 \|
	\| IEMOCAPGender \| 0.7790 \| 0.5269 \| 0.7712 \|
	\| JamAltArtistA2ARetrieval \| 0.8981 \| 0.6786 \| 0.8490 \|
	\| MInDS14 \| 0.0818 \| 0.1057 \| 0.0967 \|
	\| MridinghamTonic \| 0.3434 \| 0.3080 \| 0.3450 \|
	\| NMSQAPairClassification \| 0.4714 \| 0.4360 \| 0.5875 \|
	\| SIBFLEURS \| 0.1515 \| 0.1554 \| 0.1456 \|
	\| VehicleSoundClustering \| 0.0065 \| 0.1194 \| 0.0162 \|
	\| VoxCelebSA \| 0.2377 \| 0.1673 \| 0.2601 \|
	\| VoxPopuliAccentPairClassification \| 0.5158 \| 0.5196 \| 0.5235 \|
	\| VoxPopuliGenderClustering \| 0.0057 \| 0.0008 \| 0.0014 \|
	\| VoxPopuliLanguageID \| 0.1900 \| 0.5900 \| 0.2780 \|

	Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean,
	while base `mn20_as` remains stronger on several music/general-audio tasks.
	Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
	not a general audio embedding model and is weaker on broad environmental-audio
	coverage in this comparison.

	Artifacts:

	- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md`
	- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json`

	## Limitations

	`AS-20M` is an audio embedding model only. It does not transcribe speech,
	classify audio events directly, or embed text. Text-audio retrieval requires
	a separate compatible text encoder/head that is not included in this release
	artifact.