ogma-micro / README.md
sam-at-axiotic's picture
Upload README.md with huggingface_hub
57ea265 verified
---
license: cc-by-nc-4.0
language:
- en
tags:
- embeddings
- dense-retrieval
- matryoshka
- rag
- agents
- mteb
- sentence-similarity
- semantic-search
- text-embeddings
- text-embedding
- vector-search
- document-retrieval
- similarity-search
- classification
- clustering
- edge-ai
- on-device
- local-inference
- efficient-ai
- rag-retrieval
library_name: ogma
metrics:
- mteb
model-index:
- name: axiotic/ogma-micro
results:
- task:
type: sts
dataset:
name: MTEB STSBenchmark
type: mteb/stsbenchmark-sts
split: test
revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
metrics:
- type: cosine_spearman
value: 77.82
- task:
type: classification
dataset:
name: MTEB AmazonPolarityClassification
type: mteb/amazon_polarity
split: test
metrics:
- type: accuracy
value: 67.63
- task:
type: clustering
dataset:
name: MTEB RedditClustering
type: mteb/reddit-clustering
split: test
metrics:
- type: v_measure
value: 37.83
- task:
type: pair-classification
dataset:
name: MTEB TwitterSemEval2015
type: mteb/twittersemeval2015-pairclassification
split: test
metrics:
- type: cos_sim_ap
value: 60.03
- task:
type: reranking
dataset:
name: MTEB MindSmallReranking
type: mteb/mind_small
split: validation
metrics:
- type: map
value: 30.1
- task:
type: retrieval
dataset:
name: MTEB MSMARCO
type: mteb/msmarco
split: dev
metrics:
- type: ndcg_at_10
value: 21.78
- task:
type: summarization
dataset:
name: MTEB SummEval
type: mteb/summeval
split: test
metrics:
- type: cos_sim_spearman
value: 31.77
pipeline_tag: sentence-similarity
---
# ogma-micro  ·  2.3M efficient text embedding model  ·  MTEB 52.18
> Ultra-small English text embedding model for semantic search, RAG, vector search, clustering, classification, and agent memory — MTEB 52.18, 2.3M parameters, 128d output
**Ogma Micro** is the most compact model in the Ogma family. At 2.3M parameters and 8.9 MB it scores **52.18 MTEB** in our 66-task run while staying small enough to ship in browsers and on-device runtimes. Outputs 128-dimensional embeddings for maximum indexing efficiency. For extreme latency, edge, and browser workloads.
## Why the name Ogma?
Ogma is named after **Ogma** (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing **Ogham**, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over.
---
## Use cases
ogma-micro is the smallest Ogma model, built for **on-device embedding**, **edge search**, **browser-side retrieval**, **local semantic search**, **agent memory**, **deduplication**, **classification**, **clustering**, and privacy-sensitive applications where sending text to an external embedding API is undesirable.
Good fits:
- **Mobile and desktop apps** that need local text embeddings without a large model download.
- **Browser, WebAssembly, and extension-style workflows** where package size and vector index size matter.
- **Serverless and high-fanout applications** that need many cheap embedding calls with predictable memory use.
- **Local-first search** over notes, messages, logs, support tickets, snippets, or small document collections.
- **Efficient vector databases** where 128-dimensional embeddings reduce storage, bandwidth, and ANN latency.
Choose ogma-micro when footprint matters more than absolute benchmark quality. Move up to **ogma-mini** or **ogma-small** when you can spend more memory for stronger representations.
---
## Highlights
- 🏆 **MTEB avg 52.18** — compact 2.3M-parameter model from the canonical Ogma paper results
- 📦 **8.9 MB** — smallest in the family
- 📐 **128-dim output** — half the index size of other Ogma models
- 📏 **1024-token context** — 4× longer than all-MiniLM-L6-v2 (256 tokens)
- 🔀 **Symmetric routing** via task tokens — encode everything with `[SYM]`, or use `[QRY]`/`[QRY]` for retrieval (queries and documents both encoded with `task="qry"`); benchmark both routes on your task
- 📐 **Matryoshka dims**: [128, 64, 32] — compress to 32d for ultra-low memory indexing
---
## Performance
### MTEB English — 66/66 tasks (category-averaged)
Benchmarked with [MTEB](https://github.com/embeddings-benchmark/mteb) v2.10.7 on the standard 66-task English benchmark using category averaging (same methodology as the MTEB leaderboard).
| Category | ogma-micro | all-MiniLM-L6-v2 | Δ vs MiniLM |
|---|---|---|---|
| Classification | **59.53** | 62.62 | -3.09 |
| Clustering | **36.88** | 41.94 | -5.06 |
| PairClassification | **78.62** | 82.37 | -3.75 |
| Reranking | **49.74** | 58.04 | -8.30 |
| Retrieval | **33.09** | 41.95 | -8.86 |
| STS | **75.63** | 78.90 | -3.27 |
| Summarization | **31.77** | 30.81 | +0.96 |
| **Overall** | **52.18** | *56.09* | **-3.91** |
### Why choose Ogma Micro?
ogma-micro is for when you need the absolute smallest possible model that still achieves competitive MTEB scores. Note the 128-dim output — your vector index will be half the size of other Ogma models. Use **ogma-mini** if you can afford 3.5M parameters.
### Safety — Toxicity & Prompt Injection Detection
Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against `all-MiniLM-L6-v2` as baseline.
#### 1. Jigsaw Toxic Comment Classification
**Dataset:** `Arsive/toxicity_classification_jigsaw` — Binary toxicity classification
**Train:** 25,960 · **Test:** 6,490
| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| **Ogma** | LogReg | 89.12% | **88.26%** | 89.09% | 87.44% | 95.74% |
| **Ogma** | MLP | 88.91% | 87.98% | 89.14% | 86.85% | 95.92% |
| MiniLM | LogReg | 87.32% | 86.25% | 87.46% | 85.07% | 94.96% |
| MiniLM | MLP | 91.71% | 91.24% | 90.13% | 92.39% | **97.16%** |
Ogma (LR) leads MiniLM (LR) by **+2.01% F1**. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations.
#### 2. Prompt Injection Detection — deepset/prompt-injections
**Dataset:** `deepset/prompt-injections` — Binary injection detection
**Train:** 546 · **Test:** 116 (low-data regime)
| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| **Ogma** | LogReg | 86.21% | 84.62% | **100.0%** | 73.33% | **97.77%** |
| **Ogma** | MLP | **90.52%** | **90.27%** | 96.23% | 85.0% | 98.1% |
| MiniLM | LogReg | 82.76% | 80.39% | 97.62% | 68.33% | 94.52% |
| MiniLM | MLP | 87.07% | 86.24% | 95.92% | 78.33% | 93.96% |
Ogma leads across both classifiers: **+4.03% F1 (MLP)**, **+4.23% F1 (LogReg)**. Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives.
#### 3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset
**Dataset:** `neuralchemy/Prompt-injection-dataset` — Binary injection detection
**Train:** 4,391 · **Test:** 942
| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| **Ogma** | LogReg | 95.22% | 95.93% | 95.84% | **96.01%** | **99.30%** |
| **Ogma** | MLP | **95.44%** | **96.16%** | 94.89% | 97.46% | **99.37%** |
| MiniLM | LogReg | 94.59% | 95.38% | 95.46% | 95.29% | 98.92% |
| MiniLM | MLP | 93.95% | 94.85% | 94.59% | 95.11% | 98.92% |
Ogma leads across all metrics: **+0.78% F1 (MLP)**, **+0.55% F1 (LR)**. Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%).
#### Summary
| Task | Ogma best F1 | MiniLM best F1 | Δ |
|---|---|---|---|
| Jigsaw Toxicity | 88.26% (LR) | 91.24% (MLP) | −2.98% |
| deepset Injection | **90.27% (MLP)** | 86.24% (MLP) | **+4.03%** |
| neuralchemy Injection | **96.16% (MLP)** | 95.38% (LR) | **+0.78%** |
Ogma is a stronger feature extractor for **prompt injection detection** — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice.
---
## Architecture
| Property | Value |
|---|---|
| Architecture | Custom Transformer |
| Internal dim (`d_model`) | 128 |
| Output dim (`d_output`) | 128 |
| Transformer layers | 2 |
| Attention heads | 2 |
| Vocabulary | 30,000 (SentencePiece / AlbertTokenizer) |
| Max sequence length | **1,024 tokens** |
| Pooling | Mean pooling |
| Task tokens | `[QRY]` (query), `[DOC]` (document), `[SYM]` (symmetric) |
| Matryoshka dims | [32, 64, 128] |
| Output normalisation | L2 (unit sphere) |
| Parameters | 2.3M |
| Model file | `model.safetensors` (8.9 MB) |
**Key design choices:**
- **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. **Recommended inference route: `[QRY]`/`[QRY]`** — encode both queries and documents with `[QRY]`; this benchmarked highest on MTEB. `[SYM]` everywhere is the next-best symmetric alternative. **We do not recommend `[DOC]` at inference time** — it is exposed for downstream fine-tuning, not as an asymmetric query/document route.
- **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
- **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
- **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.
---
## Usage
### Installation
```bash
pip install torch tokenizers transformers huggingface_hub
```
### Basic Encoding
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("axiotic/ogma-micro", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("axiotic/ogma-micro", trust_remote_code=True)
sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn vulpine leaps over an idle canine",
"The capital of France is Paris",
]
emb = model.embed(sentences, task="sym", tokenizer=tok)
# emb.shape → (128,) per sentence, L2-normalised
sim = (emb[0] @ emb[1]).item() # cosine sim == dot product (L2-normalised)
print(f"paraphrase: {sim:.4f}")
```
`task="sym"` is a safe default for all similarity tasks (STS, clustering,
classification) and for retrieval. Ogma is trained for **symmetric routing**
queries and documents are always encoded with the **same** task token. The two
recommended routes are:
1. `[SYM]` for everything (the safe default above), or
2. `[QRY]`/`[QRY]` — encode both queries **and** documents with `task="qry"`.
Try both on your downstream task; either can win depending on the data, and
`[QRY]`/`[QRY]` is the natural starting point when fine-tuning a classifier or
retrieval head on top of the embeddings.
### Retrieval
Encode queries **and** documents with the **same** task token. Below we show the `[QRY]`/`[QRY]` route — both calls use `task="qry"`. This is intentional (Ogma is symmetric, not asymmetric); swap in `task="sym"` to compare the SYM route on your data.
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("axiotic/ogma-micro", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("axiotic/ogma-micro", trust_remote_code=True)
queries = ["What is knowledge distillation?"]
docs = [
"Knowledge distillation trains a smaller student model to mimic a larger teacher.",
"The Eiffel Tower is in Paris, France.",
]
q = model.embed(queries, task="qry", tokenizer=tok) # (128,) per query — symmetric: both sides use qry
d = model.embed(docs, task="qry", tokenizer=tok) # (128,) per doc — not a typo; Ogma is symmetric
scores = (q @ d.T).squeeze(0) # cosine sim (L2-normalised, dot == cosine)
print(scores.tolist()) # [higher, lower] — first doc is relevant
```
### Matryoshka — Flexible Dimensionality
Ogma is trained with Matryoshka Representation Learning. Slice and re-normalise
to any supported sub-dimension with no retraining:
```python
import torch, torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("axiotic/ogma-micro", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("axiotic/ogma-micro", trust_remote_code=True)
emb = model.embed(["hello world"], task="sym", tokenizer=tok) # full 128d
for d in model.config.matryoshka_dims:
sub = F.normalize(emb[:, :d], dim=-1)
print(f"{d}d norm={sub.norm(dim=-1).item():.4f}")
```
## Model Family
| Model | Params | Size | MTEB Avg | Class | Clust | PairClass | Rerank | Ret | STS | Summ | d_out | Context |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| **[ogma-large](https://huggingface.co/axiotic/ogma-large)** | 32.4M | 124 MB | **57.41** | 68.6 | 41.6 | 84.0 | 53.1 | 43.7 | 83.7 | 30.9 | 256 | 1024 |
| **[ogma-base](https://huggingface.co/axiotic/ogma-base)** | 13.3M | 51 MB | 57.02 | 67.74 | 41.49 | 83.73 | 51.25 | 42.36 | 82.84 | 29.73 | 256 | 1024 |
| **[ogma-small](https://huggingface.co/axiotic/ogma-small)** | 8.6M | 33 MB | **56.32** | 66.49 | 40.69 | 82.91 | 50.51 | 42.05 | 82.00 | 29.59 | 256 | 1024 |
| **[ogma-mini](https://huggingface.co/axiotic/ogma-mini)** | 3.5M | 14 MB | 53.06 | 61.77 | 37.38 | 79.66 | 47.39 | 36.21 | 77.71 | 31.33 | 256 | 1024 |
| **[ogma-micro](https://huggingface.co/axiotic/ogma-micro)** | 2.3M | 8.9 MB | 52.18 | 59.53 | 36.88 | 78.62 | 49.74 | 33.09 | 75.63 | 31.77 | 128 | 1024 |
| *all-MiniLM-L6-v2* | 22.7M | 87 MB | *56.09* | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 384 | 256 |
| *potion-base-32M* | 32.0M | 123 MB | *51.22* | 66.0 | 39.2 | 78.2 | 50.9 | 32.2 | 73.9 | 29.8 | 256 | inf |
| *potion-base-8M* | 7.6M | 29 MB | *50.03* | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 256 | inf |
All Ogma: MTEB 2.10.7, 66-task standard English set, category-averaged.
MiniLM/Potion: published scores from the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).
---
## Training Details
| Property | Value |
|---|---|
| Teacher model | `jinaai/jina-embeddings-v5-text-small` (CC-BY-NC-4.0) |
| Training paradigm | Knowledge distillation from cached teacher embeddings |
| Training data | ~7M curated English sentence pairs |
| Tokenizer | AlbertTokenizer (SentencePiece, vocab=30,000) |
| Embedding initialisation | PCA of teacher embeddings (128d) projected to d_model |
| Loss | Distillation + contrastive (balanced schedule) |
| Evaluation framework | MTEB 2.10.7 |
---
## Limitations
- **No text generation.** Ogma is an encoder-only embedding model.
- **English only.** Training data and evaluation are English-only.
- **Slower than static models.** Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences.
- **Non-commercial licence.** Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run).
- **Reranking gap.** Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -8.3). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking.
---
## Licence & Attribution
This model is released under **[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/)** (Creative Commons Attribution-NonCommercial 4.0 International).
**Required attribution (must be included in all uses):**
> This model was trained via knowledge distillation from
> `jina-embeddings-v5-text-small` (https://huggingface.co/jinaai/jina-embeddings-v5-text-small)
> by Jina AI, licensed under CC-BY-NC-4.0.
---
## Citation
```bibtex
@misc{ogma2026,
title = {Ogma: Efficient Dense Retrieval via Structured Embeddings},
author = {Axiotic AI},
year = {2026},
url = {https://huggingface.co/axiotic/ogma-micro},
}
```