File size: 4,859 Bytes
7a716db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72ddab5
7a716db
 
 
 
 
 
72ddab5
 
 
 
7a716db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72ddab5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
language:
- en
license: apache-2.0
tags:
- audio
- speech
- embedding
- retrieval
- feature-extraction
- efficientat
- matryoshka
- memory-augmentation
library_name: pytorch
pipeline_tag: feature-extraction
datasets:
- custom
---

# AS-20M

`AS-20M` is a standalone audio + speech embedding encoder for
human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT
backbone with the speech/audio LoRA training merged into the released weights,
so inference does not require loading a separate adapter.

Canonical name:

- `AS` = audio + speech
- `20M` = 19,837,720 loaded parameters, rounded to integer millions

## Runtime Contract

Input is mono audio resampled to 32 kHz. The expected preprocessing is the
EfficientAT mel frontend used during training:

- sample rate: `32000`
- FFT: `1024`
- window length: `800`
- hop size: `320`
- mel bins: `128`

The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles,
truncate and renormalize:

```text
z1280 = l2norm(model(audio))
z768  = l2norm(z1280[0:768])
z512  = l2norm(z1280[0:512])
z256  = l2norm(z1280[0:256])
z128  = l2norm(z1280[0:128])
```

## Artifacts

- `AS-20M.safetensors`: standalone native EfficientAT embedding model
- `config.json`: release and architecture metadata
- `preprocessor_config.json`: waveform and mel frontend contract
- `manifest.json`: file hashes and source checkpoint lineage

## Training Summary

This checkpoint was continued from the balanced native `mn20_as` student and
trained on an audio-heavy mix of synthetic speech/audio alignment data. The
published artifact contains merged weights, not a runtime LoRA adapter.

Source checkpoint:

```text
triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
```

Merged LoRA source:

```text
triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt
```

## Local Gate Metrics

The checkpoint-local heldout gate reported audio-side consistency metrics:

| Metric | Score |
|---|---:|
| audio cosine | 0.8108 |
| embedding Pearson | 0.7953 |
| similarity Pearson | 0.8853 |

Internal training runs also tracked text-audio retrieval against a companion
text embedding space. Those numbers are not reported here as standalone model
capabilities because this release artifact does not include a text encoder.

## MAEB Audio-Only Comparison

This comparison uses the same 20 MAEB audio-only tasks for all three
standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded
because base `mn20_as` and Whisper-Tiny do not include a compatible text
encoder; no text adapters were invented for those baselines.

Validation: each run completed 20/20 tasks with `exception_count=0`.

| Model | Params | Native output | Mean primary |
|---|---:|---:|---:|
| base `mn20_as` | 17.9M | 1920d audio feature | 0.3977 |
| Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 |
| `AS-20M` | 19.8M | 1280d embedding | 0.4083 |

| Task | base `mn20_as` | Whisper-Tiny | `AS-20M` |
|---|---:|---:|---:|
| BeijingOpera | 0.8470 | 0.5933 | 0.8349 |
| BirdCLEF | 0.2070 | 0.0730 | 0.1730 |
| CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 |
| CREMA_D | 0.2804 | 0.2995 | 0.3351 |
| CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 |
| CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 |
| FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 |
| GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 |
| GTZANGenre | 0.8260 | 0.4550 | 0.7300 |
| IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 |
| JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 |
| MInDS14 | 0.0818 | 0.1057 | 0.0967 |
| MridinghamTonic | 0.3434 | 0.3080 | 0.3450 |
| NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 |
| SIBFLEURS | 0.1515 | 0.1554 | 0.1456 |
| VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 |
| VoxCelebSA | 0.2377 | 0.1673 | 0.2601 |
| VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 |
| VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 |
| VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 |

Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean,
while base `mn20_as` remains stronger on several music/general-audio tasks.
Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is
not a general audio embedding model and is weaker on broad environmental-audio
coverage in this comparison.

Artifacts:

- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md`
- `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json`

## Limitations

`AS-20M` is an audio embedding model only. It does not transcribe speech,
classify audio events directly, or embed text. Text-audio retrieval requires
a separate compatible text encoder/head that is not included in this release
artifact.