Upload README.md with huggingface_hub

Files changed (1) hide show

README.md CHANGED Viewed

@@ -232,7 +232,7 @@ Ogma is a stronger feature extractor for **prompt injection detection** — the
 **Key design choices:**
-- **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. Recommended inference routes are **symmetric**: encode everything with `[SYM]`, or encode both queries and documents with `[QRY]` (the `[QRY]`/`[QRY]` route). The `[DOC]` token is exposed for downstream fine-tuning rather than as a recommended asymmetric query/document route at inference.
 - **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
 - **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
 - **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.

 **Key design choices:**
+- **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. **Recommended inference route: `[QRY]`/`[QRY]`** — encode both queries and documents with `[QRY]`; this benchmarked highest on MTEB. `[SYM]` everywhere is the next-best symmetric alternative. **We do not recommend `[DOC]` at inference time** — it is exposed for downstream fine-tuning, not as an asymmetric query/document route.
 - **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
 - **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
 - **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.