| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - multimodal |
| - embedding |
| - trimodal |
| - retrieval |
| - image-text-audio |
| - feature-extraction |
| library_name: pytorch |
| pipeline_tag: feature-extraction |
| datasets: |
| - custom |
| --- |
| |
| # ESS-AIST-81M Preview |
|
|
| `ESS-AIST-81M Preview` is the current Cortext trial checkpoint from the ESS line. |
|
|
| - release checkpoint: `ess_aist_full_v9_subjectfix_l4k/best_model.pt` |
| - exported checkpoint epoch: `3` |
| - text encoder: `MongoDB/mdbr-leaf-ir` |
| - image encoder: `mobilenetv4_conv_medium.e180_r384_in12k` |
| - audio encoder: native `mn20_as` EfficientAT LoRA audio backbone |
|
|
| This preview is the current bridge artifact for Cortext. It keeps the ESS |
| `semantic / subject / event` slice layout, but the `v9` dataset repair moved |
| the `subject` slice much closer to the entity signal Cortext actually needs. |
|
|
| GGUF quantizations for this exact release live in: |
|
|
| - `augmem/ESS-AIST-81M-preview-GGUF` |
|
|
| Tradeoff: |
|
|
| - held-out subject/entity separation is much stronger than the earlier `v7` preview |
| - speech and SALT retrieval are weaker than the earlier `v7` retrieval-max point |
|
|
| For Cortext, this is still the better preview because the entity-side signal is |
| materially stronger. |
|
|
| ## Embedding Layout |
|
|
| Output embedding: `1536d` |
|
|
| - `0:512` semantic |
| - `512:1024` subject |
| - `1024:1536` event |
|
|
| Recommended normalized runtime views: |
|
|
| - `semantic_key = l2norm(z[0:512])` |
| - `subject_key = l2norm(z[512:1024])` |
| - `event_key = l2norm(z[1024:1536])` |
| - `full_key = l2norm(z[0:1536])` |
|
|
| ## Exact Release Metrics |
|
|
| All numbers below are from the exact published checkpoint state exported from |
| `ess_aist_full_v9_subjectfix_l4k/best_model.pt` at checkpoint epoch `3`. |
|
|
| Evaluation scope note: |
|
|
| - `SALT` is no longer a clean out-of-training benchmark for this ESS line because SALT-derived rows were used during training. |
| - `speech holdout` is also train-adjacent rather than fully external because explicit speech/audio-text data was added back into the corpus. |
| - In this release, those two surfaces should be read as regression gates and in-domain checks, not as contamination-free generalization claims. |
| - A later full external sweep (`MTEB / MIEB / MAEB`) is still pending. |
|
|
| ### 512d Retrieval |
|
|
| Source: |
| - `retrieval_512_gt1030.json` |
|
|
| Speech holdout: |
|
|
| - `A->T_r1 = 0.3276` |
| - `T->A_r1 = 0.3202` |
| - `A->T_r5 = 0.6120` |
| - `T->A_r5 = 0.6046` |
|
|
| SALT: |
|
|
| - `I->T_r1 = 0.3179` |
| - `T->I_r1 = 0.3425` |
| - `A->T_r1 = 0.1226` |
| - `T->A_r1 = 0.1272` |
| - `I->A_r1 = 0.1970` |
| - `A->I_r1 = 0.2148` |
|
|
| ### Held-Out ESS Eval |
|
|
| Sources: |
|
|
| - `subject_eval.json` |
| - `event_eval.json` |
| - `prefix_eval.json` |
|
|
| Subject / entity surface: |
|
|
| - `subject_key` same/different AUC: `0.9881` |
| - `subject_key` same-topic-different-subject rejection AUC: `0.9881` |
|
|
| Event / disambiguation surface: |
|
|
| - `subject_key` event same/different AUC: `0.8855` |
| - `event_key` event same/different AUC: `0.8193` |
| - `subject_key` same-subject-different-event rejection AUC: `0.7381` |
| - `event_key` same-subject-different-event rejection AUC: `0.6807` |
| - `subject_key` topic-shift rejection AUC: `0.9513` |
| - `event_key` topic-shift rejection AUC: `0.8969` |
|
|
| Interpretation: |
|
|
| - the repaired `v9` held-out surface is no longer near-random on subject/entity |
| - the current `subject` slice is the strongest entity carrier in the model |
| - event structure is usable, but still entangled with subject |
| - this is the right bridge checkpoint for Cortext, not the final `semantic/entity` architecture |
|
|
| ## Architecture |
|
|
| This preview is a frozen-encoder / trainable-projector stack: |
|
|
| - text encoder params: `22,861,056` |
| - image encoder params: `8,434,512` |
| - audio encoder params: `20,639,974` |
| - image projection params: `9,975,296` |
| - audio projection params: `9,975,296` |
| - text projection params: `8,926,720` |
| - total exact loaded params: `80,812,854` |
|
|
| The audio path is not the old dual-audio teacher path. It uses the native |
| audioheavy LoRA EfficientAT backbone. |
|
|
| ## Files |
|
|
| | File | Purpose | |
| |---|---| |
| | `ESS-AIST-81M.safetensors` | Full preview release artifact | |
| | `export_metadata.json` | ESS export contract | |
| | `manifest.json` | Release manifest | |
| | `parameter_breakdown.json` | Exact parameter accounting | |
| | `ess_ait_86m_spec.yaml` | Training config used for the release line | |
| | `retrieval_512_gt1030.json` | Exact 512d retrieval eval for this checkpoint | |
| | `subject_eval.json` | Exact held-out subject eval for this checkpoint | |
| | `event_eval.json` | Exact held-out event eval for this checkpoint | |
| | `prefix_eval.json` | Prefix-level AUC summary | |
|
|
| ## Caveats |
|
|
| - This is the current preview checkpoint, not the final Cortext model family. |
| - The current runtime slices are still named `semantic / subject / event`; the next family will move toward `semantic / entity`. |
| - Subject/entity is now strong on the repaired `v9` held-out surface, but event remains entangled and the engine still needs attention over active anchors for weak-reference resolution. |
| - Retrieval on `speech holdout` and `SALT` is lower than the earlier `v7` preview. |
| - `SALT` and `speech holdout` are useful release gates for this line, but they are no longer fully external benchmarks in the same way they were for the earlier pre-ESS artifacts. |
| - Use this for internal Cortext trials, not as the final memory-model release. |
|
|