Sentence Similarity
ONNX
Safetensors
English
ogma
embeddings
dense-retrieval
matryoshka
rag
agents
mteb
semantic-search
text-embeddings
text-embedding
vector-search
document-retrieval
similarity-search
classification
clustering
edge-ai
on-device
local-inference
efficient-ai
rag-retrieval
custom_code
Eval Results (legacy)
| license: cc-by-nc-4.0 | |
| language: | |
| - en | |
| tags: | |
| - embeddings | |
| - dense-retrieval | |
| - matryoshka | |
| - rag | |
| - agents | |
| - mteb | |
| - sentence-similarity | |
| - semantic-search | |
| - text-embeddings | |
| - text-embedding | |
| - vector-search | |
| - document-retrieval | |
| - similarity-search | |
| - classification | |
| - clustering | |
| - edge-ai | |
| - on-device | |
| - local-inference | |
| - efficient-ai | |
| - rag-retrieval | |
| library_name: ogma | |
| metrics: | |
| - mteb | |
| model-index: | |
| - name: axiotic/ogma-base | |
| results: | |
| - task: | |
| type: sts | |
| dataset: | |
| name: MTEB STSBenchmark | |
| type: mteb/stsbenchmark-sts | |
| split: test | |
| revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831 | |
| metrics: | |
| - type: cosine_spearman | |
| value: 86.49 | |
| - task: | |
| type: classification | |
| dataset: | |
| name: MTEB AmazonPolarityClassification | |
| type: mteb/amazon_polarity | |
| split: test | |
| metrics: | |
| - type: accuracy | |
| value: 79.85 | |
| - task: | |
| type: clustering | |
| dataset: | |
| name: MTEB RedditClustering | |
| type: mteb/reddit-clustering | |
| split: test | |
| metrics: | |
| - type: v_measure | |
| value: 44.67 | |
| - task: | |
| type: pair-classification | |
| dataset: | |
| name: MTEB TwitterSemEval2015 | |
| type: mteb/twittersemeval2015-pairclassification | |
| split: test | |
| metrics: | |
| - type: cos_sim_ap | |
| value: 70.79 | |
| - task: | |
| type: reranking | |
| dataset: | |
| name: MTEB MindSmallReranking | |
| type: mteb/mind_small | |
| split: validation | |
| metrics: | |
| - type: map | |
| value: 30.62 | |
| - task: | |
| type: retrieval | |
| dataset: | |
| name: MTEB MSMARCO | |
| type: mteb/msmarco | |
| split: dev | |
| metrics: | |
| - type: ndcg_at_10 | |
| value: 35.86 | |
| - task: | |
| type: summarization | |
| dataset: | |
| name: MTEB SummEval | |
| type: mteb/summeval | |
| split: test | |
| metrics: | |
| - type: cos_sim_spearman | |
| value: 29.73 | |
| pipeline_tag: sentence-similarity | |
| # ogma-base · 13.3M efficient text embedding model · MTEB 57.02 | |
| > High-quality English text embedding model for semantic search, RAG, vector search, retrieval, clustering, classification, STS, and agent memory — MTEB 57.02, 13.3M parameters, 1024-token context | |
| **Ogma Base** is the quality-first mid-size model in the Ogma family. At 13.3M parameters it scores **57.02 MTEB** in our canonical 66-task Ogma paper results, using only 59% of MiniLM-L6-v2's parameters and handling 4× longer input sequences (1024 vs 256 tokens). The sweet spot for quality-first RAG pipelines and agent memory. | |
| ## Why the name Ogma? | |
| Ogma is named after **Ogma** (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing **Ogham**, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over. | |
| --- | |
| ## Use cases | |
| ogma-base is the quality-first Ogma model for **semantic search**, **RAG retrieval**, **agent memory**, **vector databases**, **document retrieval**, **text classification**, **clustering**, **STS / sentence similarity**, and retrieval-heavy agent pipelines. It is aimed at users who want better quality than MiniLM-class models while keeping the model small enough for practical CPU deployment. | |
| Good fits: | |
| - **Production RAG and enterprise search** where retrieval quality matters but the model still needs to be lightweight. | |
| - **Local or private embedding services** for teams that want to avoid external embedding APIs for sensitive text. | |
| - **Agent memory systems** with long context chunks, symmetric query/document encoding (SYM everywhere, or QRY/QRY), and frequent retrieval. | |
| - **Efficient CPU deployments** where 13.3M parameters is easier to host than larger embedding transformers. | |
| - **Classification, clustering, and routing features** where embedding quality directly affects downstream decisions. | |
| Choose ogma-base when you want the strongest general-purpose Ogma model before moving into larger, accuracy-first territory. | |
| --- | |
| ## Highlights | |
| - 🏆 **MTEB avg 57.02** — canonical Ogma paper result over 66/66 MTEB English tasks | |
| - 📏 **1024-token context** — 4× longer than all-MiniLM-L6-v2 (256 tokens) | |
| - 🔀 **Symmetric routing** via task tokens — encode everything with `[SYM]`, or use `[QRY]`/`[QRY]` for retrieval (queries and documents both encoded with `task="qry"`); benchmark both routes on your task | |
| - 📐 **Matryoshka dims**: [256, 128, 64, 32] — one model, any precision | |
| - 🛡️ **+4.0% F1** on prompt injection detection vs MiniLM (same architecture series) | |
| --- | |
| ## Performance | |
| ### MTEB English — 66/66 tasks (category-averaged) | |
| Benchmarked with [MTEB](https://github.com/embeddings-benchmark/mteb) v2.10.7 on the standard 66-task English benchmark using category averaging (same methodology as the MTEB leaderboard). | |
| | Category | ogma-base | all-MiniLM-L6-v2 | Δ vs MiniLM | | |
| |---|---|---|---| | |
| | Classification | **67.74** | 62.62 | +5.12 | | |
| | Clustering | **41.49** | 41.94 | -0.45 | | |
| | PairClassification | **83.73** | 82.37 | +1.36 | | |
| | Reranking | **51.25** | 58.04 | -6.79 | | |
| | Retrieval | **42.36** | 41.95 | +0.41 | | |
| | STS | **82.84** | 78.90 | +3.94 | | |
| | Summarization | **29.73** | 30.81 | -1.08 | | |
| | **Overall** | **57.02** | *56.09* | **+0.93** | | |
| ### Why choose Ogma Base? | |
| ogma-base is the recommended choice when you want a strong quality/size tradeoff without going to full transformer scale (bge, e5). It is MiniLM-class on the external comparison line while being smaller and context-aware. | |
| ### Safety — Toxicity & Prompt Injection Detection | |
| Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against `all-MiniLM-L6-v2` as baseline. | |
| #### 1. Jigsaw Toxic Comment Classification | |
| **Dataset:** `Arsive/toxicity_classification_jigsaw` — Binary toxicity classification | |
| **Train:** 25,960 · **Test:** 6,490 | |
| | Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC | | |
| |---|---|---|---|---|---|---| | |
| | **Ogma** | LogReg | 89.12% | **88.26%** | 89.09% | 87.44% | 95.74% | | |
| | **Ogma** | MLP | 88.91% | 87.98% | 89.14% | 86.85% | 95.92% | | |
| | MiniLM | LogReg | 87.32% | 86.25% | 87.46% | 85.07% | 94.96% | | |
| | MiniLM | MLP | 91.71% | 91.24% | 90.13% | 92.39% | **97.16%** | | |
| Ogma (LR) leads MiniLM (LR) by **+2.01% F1**. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations. | |
| #### 2. Prompt Injection Detection — deepset/prompt-injections | |
| **Dataset:** `deepset/prompt-injections` — Binary injection detection | |
| **Train:** 546 · **Test:** 116 (low-data regime) | |
| | Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC | | |
| |---|---|---|---|---|---|---| | |
| | **Ogma** | LogReg | 86.21% | 84.62% | **100.0%** | 73.33% | **97.77%** | | |
| | **Ogma** | MLP | **90.52%** | **90.27%** | 96.23% | 85.0% | 98.1% | | |
| | MiniLM | LogReg | 82.76% | 80.39% | 97.62% | 68.33% | 94.52% | | |
| | MiniLM | MLP | 87.07% | 86.24% | 95.92% | 78.33% | 93.96% | | |
| Ogma leads across both classifiers: **+4.03% F1 (MLP)**, **+4.23% F1 (LogReg)**. Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives. | |
| #### 3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset | |
| **Dataset:** `neuralchemy/Prompt-injection-dataset` — Binary injection detection | |
| **Train:** 4,391 · **Test:** 942 | |
| | Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC | | |
| |---|---|---|---|---|---|---| | |
| | **Ogma** | LogReg | 95.22% | 95.93% | 95.84% | **96.01%** | **99.30%** | | |
| | **Ogma** | MLP | **95.44%** | **96.16%** | 94.89% | 97.46% | **99.37%** | | |
| | MiniLM | LogReg | 94.59% | 95.38% | 95.46% | 95.29% | 98.92% | | |
| | MiniLM | MLP | 93.95% | 94.85% | 94.59% | 95.11% | 98.92% | | |
| Ogma leads across all metrics: **+0.78% F1 (MLP)**, **+0.55% F1 (LR)**. Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%). | |
| #### Summary | |
| | Task | Ogma best F1 | MiniLM best F1 | Δ | | |
| |---|---|---|---| | |
| | Jigsaw Toxicity | 88.26% (LR) | 91.24% (MLP) | −2.98% | | |
| | deepset Injection | **90.27% (MLP)** | 86.24% (MLP) | **+4.03%** | | |
| | neuralchemy Injection | **96.16% (MLP)** | 95.38% (LR) | **+0.78%** | | |
| Ogma is a stronger feature extractor for **prompt injection detection** — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice. | |
| --- | |
| ## Architecture | |
| | Property | Value | | |
| |---|---| | |
| | Architecture | Custom Transformer | | |
| | Internal dim (`d_model`) | 256 | | |
| | Output dim (`d_output`) | 256 | | |
| | Transformer layers | 12 | | |
| | Attention heads | 4 | | |
| | Vocabulary | 30,000 (SentencePiece / AlbertTokenizer) | | |
| | Max sequence length | **1,024 tokens** | | |
| | Pooling | Mean pooling | | |
| | Task tokens | `[QRY]` (query), `[DOC]` (document), `[SYM]` (symmetric) | | |
| | Matryoshka dims | [32, 64, 128, 256] | | |
| | Output normalisation | L2 (unit sphere) | | |
| | Parameters | 13.3M | | |
| | Model file | `model.safetensors` (51 MB) | | |
| **Key design choices:** | |
| - **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. **Recommended inference route: `[QRY]`/`[QRY]`** — encode both queries and documents with `[QRY]`; this benchmarked highest on MTEB. `[SYM]` everywhere is the next-best symmetric alternative. **We do not recommend `[DOC]` at inference time** — it is exposed for downstream fine-tuning, not as an asymmetric query/document route. | |
| - **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining. | |
| - **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family. | |
| - **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage. | |
| --- | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install torch tokenizers transformers huggingface_hub | |
| ``` | |
| ### Basic Encoding | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| model = AutoModel.from_pretrained("axiotic/ogma-base", trust_remote_code=True).eval() | |
| tok = AutoTokenizer.from_pretrained("axiotic/ogma-base", trust_remote_code=True) | |
| sentences = [ | |
| "The quick brown fox jumps over the lazy dog", | |
| "A fast auburn vulpine leaps over an idle canine", | |
| "The capital of France is Paris", | |
| ] | |
| emb = model.embed(sentences, task="sym", tokenizer=tok) | |
| # emb.shape → (256,) per sentence, L2-normalised | |
| sim = (emb[0] @ emb[1]).item() # cosine sim == dot product (L2-normalised) | |
| print(f"paraphrase: {sim:.4f}") | |
| ``` | |
| `task="sym"` is a safe default for all similarity tasks (STS, clustering, | |
| classification) and for retrieval. Ogma is trained for **symmetric routing** — | |
| queries and documents are always encoded with the **same** task token. The two | |
| recommended routes are: | |
| 1. `[SYM]` for everything (the safe default above), or | |
| 2. `[QRY]`/`[QRY]` — encode both queries **and** documents with `task="qry"`. | |
| Try both on your downstream task; either can win depending on the data, and | |
| `[QRY]`/`[QRY]` is the natural starting point when fine-tuning a classifier or | |
| retrieval head on top of the embeddings. | |
| ### Retrieval | |
| Encode queries **and** documents with the **same** task token. Below we show the `[QRY]`/`[QRY]` route — both calls use `task="qry"`. This is intentional (Ogma is symmetric, not asymmetric); swap in `task="sym"` to compare the SYM route on your data. | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| model = AutoModel.from_pretrained("axiotic/ogma-base", trust_remote_code=True).eval() | |
| tok = AutoTokenizer.from_pretrained("axiotic/ogma-base", trust_remote_code=True) | |
| queries = ["What is knowledge distillation?"] | |
| docs = [ | |
| "Knowledge distillation trains a smaller student model to mimic a larger teacher.", | |
| "The Eiffel Tower is in Paris, France.", | |
| ] | |
| q = model.embed(queries, task="qry", tokenizer=tok) # (256,) per query — symmetric: both sides use qry | |
| d = model.embed(docs, task="qry", tokenizer=tok) # (256,) per doc — not a typo; Ogma is symmetric | |
| scores = (q @ d.T).squeeze(0) # cosine sim (L2-normalised, dot == cosine) | |
| print(scores.tolist()) # [higher, lower] — first doc is relevant | |
| ``` | |
| ### Matryoshka — Flexible Dimensionality | |
| Ogma is trained with Matryoshka Representation Learning. Slice and re-normalise | |
| to any supported sub-dimension with no retraining: | |
| ```python | |
| import torch, torch.nn.functional as F | |
| from transformers import AutoModel, AutoTokenizer | |
| model = AutoModel.from_pretrained("axiotic/ogma-base", trust_remote_code=True).eval() | |
| tok = AutoTokenizer.from_pretrained("axiotic/ogma-base", trust_remote_code=True) | |
| emb = model.embed(["hello world"], task="sym", tokenizer=tok) # full 256d | |
| for d in model.config.matryoshka_dims: | |
| sub = F.normalize(emb[:, :d], dim=-1) | |
| print(f"{d}d norm={sub.norm(dim=-1).item():.4f}") | |
| ``` | |
| ## Model Family | |
| | Model | Params | Size | MTEB Avg | Class | Clust | PairClass | Rerank | Ret | STS | Summ | d_out | Context | | |
| |---|---|---|---|---|---|---|---|---|---|---|---|---| | |
| | **[ogma-large](https://huggingface.co/axiotic/ogma-large)** | 32.4M | 124 MB | **57.41** | 68.6 | 41.6 | 84.0 | 53.1 | 43.7 | 83.7 | 30.9 | 256 | 1024 | | |
| | **[ogma-base](https://huggingface.co/axiotic/ogma-base)** | 13.3M | 51 MB | 57.02 | 67.74 | 41.49 | 83.73 | 51.25 | 42.36 | 82.84 | 29.73 | 256 | 1024 | | |
| | **[ogma-small](https://huggingface.co/axiotic/ogma-small)** | 8.6M | 33 MB | **56.32** | 66.49 | 40.69 | 82.91 | 50.51 | 42.05 | 82.00 | 29.59 | 256 | 1024 | | |
| | **[ogma-mini](https://huggingface.co/axiotic/ogma-mini)** | 3.5M | 14 MB | 53.06 | 61.77 | 37.38 | 79.66 | 47.39 | 36.21 | 77.71 | 31.33 | 256 | 1024 | | |
| | **[ogma-micro](https://huggingface.co/axiotic/ogma-micro)** | 2.3M | 8.9 MB | 52.18 | 59.53 | 36.88 | 78.62 | 49.74 | 33.09 | 75.63 | 31.77 | 128 | 1024 | | |
| | *all-MiniLM-L6-v2* | 22.7M | 87 MB | *56.09* | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 384 | 256 | | |
| | *potion-base-32M* | 32.0M | 123 MB | *51.22* | 66.0 | 39.2 | 78.2 | 50.9 | 32.2 | 73.9 | 29.8 | 256 | inf | | |
| | *potion-base-8M* | 7.6M | 29 MB | *50.03* | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 256 | inf | | |
| All Ogma: MTEB 2.10.7, 66-task standard English set, category-averaged. | |
| MiniLM/Potion: published scores from the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md). | |
| --- | |
| ## Training Details | |
| | Property | Value | | |
| |---|---| | |
| | Teacher model | `jinaai/jina-embeddings-v5-text-small` (CC-BY-NC-4.0) | | |
| | Training paradigm | Knowledge distillation from cached teacher embeddings | | |
| | Training data | ~7M curated English sentence pairs | | |
| | Tokenizer | AlbertTokenizer (SentencePiece, vocab=30,000) | | |
| | Embedding initialisation | PCA of teacher embeddings (128d) projected to d_model | | |
| | Loss | Distillation + contrastive (balanced schedule) | | |
| | Evaluation framework | MTEB 2.10.7 | | |
| --- | |
| ## Limitations | |
| - **No text generation.** Ogma is an encoder-only embedding model. | |
| - **English only.** Training data and evaluation are English-only. | |
| - **Slower than static models.** Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences. | |
| - **Non-commercial licence.** Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run). | |
| - **Reranking gap.** Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -6.8). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking. | |
| --- | |
| ## Licence & Attribution | |
| This model is released under **[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/)** (Creative Commons Attribution-NonCommercial 4.0 International). | |
| **Required attribution (must be included in all uses):** | |
| > This model was trained via knowledge distillation from | |
| > `jina-embeddings-v5-text-small` (https://huggingface.co/jinaai/jina-embeddings-v5-text-small) | |
| > by Jina AI, licensed under CC-BY-NC-4.0. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{ogma2026, | |
| title = {Ogma: Efficient Dense Retrieval via Structured Embeddings}, | |
| author = {Axiotic AI}, | |
| year = {2026}, | |
| url = {https://huggingface.co/axiotic/ogma-base}, | |
| } | |
| ``` | |