File size: 4,859 Bytes
7a716db 72ddab5 7a716db 72ddab5 7a716db 72ddab5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
language:
- en
license: apache-2.0
tags:
- audio
- speech
- embedding
- retrieval
- feature-extraction
- efficientat
- matryoshka
- memory-augmentation
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- custom
---
# AS-20M
`AS-20M` is a standalone audio + speech embedding encoder for
human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT
backbone with the speech/audio LoRA training merged into the released weights,
so inference does not require loading a separate adapter.
Canonical name:
- `AS` = audio + speech
- `20M` = 19,837,720 loaded parameters, rounded to integer millions
## Runtime Contract
Input is mono audio resampled to 32 kHz. The expected preprocessing is the
EfficientAT mel frontend used during training:
- sample rate: `32000`
- FFT: `1024`
- window length: `800`
- hop size: `320`
- mel bins: `128`
The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles,
truncate and renormalize:
```text
z1280 = l2norm(model(audio))
z768 = l2norm(z1280[0:768])
z512 = l2norm(z1280[0:512])
z256 = l2norm(z1280[0:256])
z128 = l2norm(z1280[0:128])
```
## Artifacts
- `AS-20M.safetensors`: standalone native EfficientAT embedding model
- `config.json`: release and architecture metadata
- `preprocessor_config.json`: waveform and mel frontend contract
- `manifest.json`: file hashes and source checkpoint lineage
## Training Summary
This checkpoint was continued from the balanced native `mn20_as` student and
trained on an audio-heavy mix of synthetic speech/audio alignment data. The
published artifact contains merged weights, not a runtime LoRA adapter.
Source checkpoint:
```text
triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
```
Merged LoRA source:
```text
triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
```
## Local Gate Metrics
The checkpoint-local heldout gate reported audio-side consistency metrics:
| Metric | Score |
|---|---:|
| audio cosine | 0.8108 |
| embedding Pearson | 0.7953 |
| similarity Pearson | 0.8853 |
Internal training runs also tracked text-audio retrieval against a companion
text embedding space. Those numbers are not reported here as standalone model
capabilities because this release artifact does not include a text encoder.
## MAEB Audio-Only Comparison
This comparison uses the same 20 MAEB audio-only tasks for all three
standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
because base `mn20_as` and Whisper-Tiny do not include a compatible text
encoder; no text adapters were invented for those baselines.
Validation: each run completed 20/20 tasks with `exception_count=0`.
| Model | Params | Native output | Mean primary |
|---|---:|---:|---:|
| base `mn20_as` | 17.9M | 1920d audio feature | 0.3977 |
| Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 |
| `AS-20M` | 19.8M | 1280d embedding | 0.4083 |
| Task | base `mn20_as` | Whisper-Tiny | `AS-20M` |
|---|---:|---:|---:|
| BeijingOpera | 0.8470 | 0.5933 | 0.8349 |
| BirdCLEF | 0.2070 | 0.0730 | 0.1730 |
| CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 |
| CREMA_D | 0.2804 | 0.2995 | 0.3351 |
| CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 |
| CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 |
| FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 |
| GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 |
| GTZANGenre | 0.8260 | 0.4550 | 0.7300 |
| IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 |
| JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 |
| MInDS14 | 0.0818 | 0.1057 | 0.0967 |
| MridinghamTonic | 0.3434 | 0.3080 | 0.3450 |
| NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 |
| SIBFLEURS | 0.1515 | 0.1554 | 0.1456 |
| VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 |
| VoxCelebSA | 0.2377 | 0.1673 | 0.2601 |
| VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 |
| VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 |
| VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 |
Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean,
while base `mn20_as` remains stronger on several music/general-audio tasks.
Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
not a general audio embedding model and is weaker on broad environmental-audio
coverage in this comparison.
Artifacts:
- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md`
- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json`
## Limitations
`AS-20M` is an audio embedding model only. It does not transcribe speech,
classify audio events directly, or embed text. Text-audio retrieval requires
a separate compatible text encoder/head that is not included in this release
artifact.
|