sam-at-axiotic commited on
Commit
726ebef
·
verified ·
1 Parent(s): bcd7c46

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -232,7 +232,7 @@ Ogma is a stronger feature extractor for **prompt injection detection** — the
232
 
233
  **Key design choices:**
234
 
235
- - **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. Recommended inference routes are **symmetric**: encode everything with `[SYM]`, or encode both queries and documents with `[QRY]` (the `[QRY]`/`[QRY]` route). The `[DOC]` token is exposed for downstream fine-tuning rather than as a recommended asymmetric query/document route at inference.
236
  - **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
237
  - **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
238
  - **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.
 
232
 
233
  **Key design choices:**
234
 
235
+ - **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. **Recommended inference route: `[QRY]`/`[QRY]`** encode both queries and documents with `[QRY]`; this benchmarked highest on MTEB. `[SYM]` everywhere is the next-best symmetric alternative. **We do not recommend `[DOC]` at inference time** — it is exposed for downstream fine-tuning, not as an asymmetric query/document route.
236
  - **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
237
  - **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
238
  - **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.