| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - audio |
| - speech |
| - embedding |
| - retrieval |
| - feature-extraction |
| - efficientat |
| - matryoshka |
| - memory-augmentation |
| library_name: pytorch |
| pipeline_tag: feature-extraction |
| datasets: |
| - custom |
| --- |
| |
| # AS-20M |
|
|
| `AS-20M` is a standalone audio + speech embedding encoder for |
| human-memory augmentation workloads. It uses a native `mn20_as` EfficientAT |
| backbone with the speech/audio LoRA training merged into the released weights, |
| so inference does not require loading a separate adapter. |
|
|
| Canonical name: |
|
|
| - `AS` = audio + speech |
| - `20M` = 19,837,720 loaded parameters, rounded to integer millions |
|
|
| ## Runtime Contract |
|
|
| Input is mono audio resampled to 32 kHz. The expected preprocessing is the |
| EfficientAT mel frontend used during training: |
|
|
| - sample rate: `32000` |
| - FFT: `1024` |
| - window length: `800` |
| - hop size: `320` |
| - mel bins: `128` |
|
|
| The model emits a 1280-dimensional embedding. For Matryoshka runtime profiles, |
| truncate and renormalize: |
|
|
| ```text |
| z1280 = l2norm(model(audio)) |
| z768 = l2norm(z1280[0:768]) |
| z512 = l2norm(z1280[0:512]) |
| z256 = l2norm(z1280[0:256]) |
| z128 = l2norm(z1280[0:128]) |
| ``` |
|
|
| ## Artifacts |
|
|
| - `AS-20M.safetensors`: standalone native EfficientAT embedding model |
| - `config.json`: release and architecture metadata |
| - `preprocessor_config.json`: waveform and mel frontend contract |
| - `manifest.json`: file hashes and source checkpoint lineage |
|
|
| ## Training Summary |
|
|
| This checkpoint was continued from the balanced native `mn20_as` student and |
| trained on an audio-heavy mix of synthetic speech/audio alignment data. The |
| published artifact contains merged weights, not a runtime LoRA adapter. |
|
|
| Source checkpoint: |
|
|
| ```text |
| triembed/checkpoints/mn20_native_merged_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt |
| ``` |
|
|
| Merged LoRA source: |
|
|
| ```text |
| triembed/checkpoints/mn20_native_lora_aistmix_audioheavy100k175k175k_continue_from_balanced_20260426T143137Z/latest_model.pt |
| ``` |
|
|
| ## Local Gate Metrics |
|
|
| The checkpoint-local heldout gate reported audio-side consistency metrics: |
|
|
| | Metric | Score | |
| |---|---:| |
| | audio cosine | 0.8108 | |
| | embedding Pearson | 0.7953 | |
| | similarity Pearson | 0.8853 | |
|
|
| Internal training runs also tracked text-audio retrieval against a companion |
| text embedding space. Those numbers are not reported here as standalone model |
| capabilities because this release artifact does not include a text encoder. |
|
|
| ## MAEB Audio-Only Comparison |
|
|
| This comparison uses the same 20 MAEB audio-only tasks for all three |
| standalone audio encoders. Cross-modal text-audio MAEB tasks are excluded |
| because base `mn20_as` and Whisper-Tiny do not include a compatible text |
| encoder; no text adapters were invented for those baselines. |
|
|
| Validation: each run completed 20/20 tasks with `exception_count=0`. |
|
|
| | Model | Params | Native output | Mean primary | |
| |---|---:|---:|---:| |
| | base `mn20_as` | 17.9M | 1920d audio feature | 0.3977 | |
| | Whisper-Tiny encoder | 8.2M encoder / 37.8M full | 384d pooled encoder state | 0.3320 | |
| | `AS-20M` | 19.8M | 1280d embedding | 0.4083 | |
|
|
| | Task | base `mn20_as` | Whisper-Tiny | `AS-20M` | |
| |---|---:|---:|---:| |
| | BeijingOpera | 0.8470 | 0.5933 | 0.8349 | |
| | BirdCLEF | 0.2070 | 0.0730 | 0.1730 | |
| | CREMADPairClassification | 0.5458 | 0.5752 | 0.5475 | |
| | CREMA_D | 0.2804 | 0.2995 | 0.3351 | |
| | CREMA_DClustering | 0.0229 | 0.0955 | 0.0943 | |
| | CommonLanguageAgeDetection | 0.1401 | 0.2108 | 0.1799 | |
| | FSD2019Kaggle | 0.5734 | 0.0964 | 0.6230 | |
| | GTZANAudioReranking | 0.8298 | 0.6340 | 0.7747 | |
| | GTZANGenre | 0.8260 | 0.4550 | 0.7300 | |
| | IEMOCAPGender | 0.7790 | 0.5269 | 0.7712 | |
| | JamAltArtistA2ARetrieval | 0.8981 | 0.6786 | 0.8490 | |
| | MInDS14 | 0.0818 | 0.1057 | 0.0967 | |
| | MridinghamTonic | 0.3434 | 0.3080 | 0.3450 | |
| | NMSQAPairClassification | 0.4714 | 0.4360 | 0.5875 | |
| | SIBFLEURS | 0.1515 | 0.1554 | 0.1456 | |
| | VehicleSoundClustering | 0.0065 | 0.1194 | 0.0162 | |
| | VoxCelebSA | 0.2377 | 0.1673 | 0.2601 | |
| | VoxPopuliAccentPairClassification | 0.5158 | 0.5196 | 0.5235 | |
| | VoxPopuliGenderClustering | 0.0057 | 0.0008 | 0.0014 | |
| | VoxPopuliLanguageID | 0.1900 | 0.5900 | 0.2780 | |
|
|
| Interpretation: `AS-20M` is slightly ahead on the 20-task audio-only mean, |
| while base `mn20_as` remains stronger on several music/general-audio tasks. |
| Whisper-Tiny is competitive on some speech/language-adjacent tasks, but it is |
| not a general audio embedding model and is weaker on broad environmental-audio |
| coverage in this comparison. |
|
|
| Artifacts: |
|
|
| - `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.md` |
| - `triembed/results/maeb_audio_only_3model_final_20260505T215838Z.json` |
|
|
| ## Limitations |
|
|
| `AS-20M` is an audio embedding model only. It does not transcribe speech, |
| classify audio events directly, or embed text. Text-audio retrieval requires |
| a separate compatible text encoder/head that is not included in this release |
| artifact. |
|
|