README.md · axiotic/ogma-base at main

ogma-base / README.md

sam-at-axiotic

Upload README.md with huggingface_hub

726ebef verified 14 days ago

preview code

raw

history blame contribute delete

17 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	tags:
	- embeddings
	- dense-retrieval
	- matryoshka
	- rag
	- agents
	- mteb
	- sentence-similarity
	- semantic-search
	- text-embeddings
	- text-embedding
	- vector-search
	- document-retrieval
	- similarity-search
	- classification
	- clustering
	- edge-ai
	- on-device
	- local-inference
	- efficient-ai
	- rag-retrieval
	library_name: ogma
	metrics:
	- mteb
	model-index:
	- name: axiotic/ogma-base
	results:
	- task:
	type: sts
	dataset:
	name: MTEB STSBenchmark
	type: mteb/stsbenchmark-sts
	split: test
	revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
	metrics:
	- type: cosine_spearman
	value: 86.49
	- task:
	type: classification
	dataset:
	name: MTEB AmazonPolarityClassification
	type: mteb/amazon_polarity
	split: test
	metrics:
	- type: accuracy
	value: 79.85
	- task:
	type: clustering
	dataset:
	name: MTEB RedditClustering
	type: mteb/reddit-clustering
	split: test
	metrics:
	- type: v_measure
	value: 44.67
	- task:
	type: pair-classification
	dataset:
	name: MTEB TwitterSemEval2015
	type: mteb/twittersemeval2015-pairclassification
	split: test
	metrics:
	- type: cos_sim_ap
	value: 70.79
	- task:
	type: reranking
	dataset:
	name: MTEB MindSmallReranking
	type: mteb/mind_small
	split: validation
	metrics:
	- type: map
	value: 30.62
	- task:
	type: retrieval
	dataset:
	name: MTEB MSMARCO
	type: mteb/msmarco
	split: dev
	metrics:
	- type: ndcg_at_10
	value: 35.86
	- task:
	type: summarization
	dataset:
	name: MTEB SummEval
	type: mteb/summeval
	split: test
	metrics:
	- type: cos_sim_spearman
	value: 29.73
	pipeline_tag: sentence-similarity
	---

	# ogma-base  ·  13.3M efficient text embedding model  ·  MTEB 57.02

	> High-quality English text embedding model for semantic search, RAG, vector search, retrieval, clustering, classification, STS, and agent memory — MTEB 57.02, 13.3M parameters, 1024-token context

	Ogma Base is the quality-first mid-size model in the Ogma family. At 13.3M parameters it scores 57.02 MTEB in our canonical 66-task Ogma paper results, using only 59% of MiniLM-L6-v2's parameters and handling 4× longer input sequences (1024 vs 256 tokens). The sweet spot for quality-first RAG pipelines and agent memory.

	## Why the name Ogma?

	Ogma is named after Ogma (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing Ogham, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over.

	---

	## Use cases

	ogma-base is the quality-first Ogma model for semantic search, RAG retrieval, agent memory, vector databases, document retrieval, text classification, clustering, STS / sentence similarity, and retrieval-heavy agent pipelines. It is aimed at users who want better quality than MiniLM-class models while keeping the model small enough for practical CPU deployment.

	Good fits:

	- Production RAG and enterprise search where retrieval quality matters but the model still needs to be lightweight.
	- Local or private embedding services for teams that want to avoid external embedding APIs for sensitive text.
	- Agent memory systems with long context chunks, symmetric query/document encoding (SYM everywhere, or QRY/QRY), and frequent retrieval.
	- Efficient CPU deployments where 13.3M parameters is easier to host than larger embedding transformers.
	- Classification, clustering, and routing features where embedding quality directly affects downstream decisions.

	Choose ogma-base when you want the strongest general-purpose Ogma model before moving into larger, accuracy-first territory.

	---

	## Highlights

	- 🏆 MTEB avg 57.02 — canonical Ogma paper result over 66/66 MTEB English tasks
	- 📏 1024-token context — 4× longer than all-MiniLM-L6-v2 (256 tokens)
	- 🔀 Symmetric routing via task tokens — encode everything with `[SYM]`, or use `[QRY]`/`[QRY]` for retrieval (queries and documents both encoded with `task="qry"`); benchmark both routes on your task
	- 📐 Matryoshka dims: [256, 128, 64, 32] — one model, any precision
	- 🛡️ +4.0% F1 on prompt injection detection vs MiniLM (same architecture series)

	---

	## Performance

	### MTEB English — 66/66 tasks (category-averaged)

	Benchmarked with [MTEB](https://github.com/embeddings-benchmark/mteb) v2.10.7 on the standard 66-task English benchmark using category averaging (same methodology as the MTEB leaderboard).

	\| Category \| ogma-base \| all-MiniLM-L6-v2 \| Δ vs MiniLM \|
	\|---\|---\|---\|---\|
	\| Classification \| 67.74 \| 62.62 \| +5.12 \|
	\| Clustering \| 41.49 \| 41.94 \| -0.45 \|
	\| PairClassification \| 83.73 \| 82.37 \| +1.36 \|
	\| Reranking \| 51.25 \| 58.04 \| -6.79 \|
	\| Retrieval \| 42.36 \| 41.95 \| +0.41 \|
	\| STS \| 82.84 \| 78.90 \| +3.94 \|
	\| Summarization \| 29.73 \| 30.81 \| -1.08 \|
	\| Overall \| 57.02 \| 56.09 \| +0.93 \|

	### Why choose Ogma Base?

	ogma-base is the recommended choice when you want a strong quality/size tradeoff without going to full transformer scale (bge, e5). It is MiniLM-class on the external comparison line while being smaller and context-aware.

	### Safety — Toxicity & Prompt Injection Detection

	Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against `all-MiniLM-L6-v2` as baseline.

	#### 1. Jigsaw Toxic Comment Classification

	Dataset: `Arsive/toxicity_classification_jigsaw` — Binary toxicity classification
	Train: 25,960 · Test: 6,490

	\| Model \| Classifier \| Accuracy \| F1 \| Precision \| Recall \| AUC-ROC \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Ogma \| LogReg \| 89.12% \| 88.26% \| 89.09% \| 87.44% \| 95.74% \|
	\| Ogma \| MLP \| 88.91% \| 87.98% \| 89.14% \| 86.85% \| 95.92% \|
	\| MiniLM \| LogReg \| 87.32% \| 86.25% \| 87.46% \| 85.07% \| 94.96% \|
	\| MiniLM \| MLP \| 91.71% \| 91.24% \| 90.13% \| 92.39% \| 97.16% \|

	Ogma (LR) leads MiniLM (LR) by +2.01% F1. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations.

	#### 2. Prompt Injection Detection — deepset/prompt-injections

	Dataset: `deepset/prompt-injections` — Binary injection detection
	Train: 546 · Test: 116 (low-data regime)

	\| Model \| Classifier \| Accuracy \| F1 \| Precision \| Recall \| AUC-ROC \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Ogma \| LogReg \| 86.21% \| 84.62% \| 100.0% \| 73.33% \| 97.77% \|
	\| Ogma \| MLP \| 90.52% \| 90.27% \| 96.23% \| 85.0% \| 98.1% \|
	\| MiniLM \| LogReg \| 82.76% \| 80.39% \| 97.62% \| 68.33% \| 94.52% \|
	\| MiniLM \| MLP \| 87.07% \| 86.24% \| 95.92% \| 78.33% \| 93.96% \|

	Ogma leads across both classifiers: +4.03% F1 (MLP), +4.23% F1 (LogReg). Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives.

	#### 3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset

	Dataset: `neuralchemy/Prompt-injection-dataset` — Binary injection detection
	Train: 4,391 · Test: 942

	\| Model \| Classifier \| Accuracy \| F1 \| Precision \| Recall \| AUC-ROC \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Ogma \| LogReg \| 95.22% \| 95.93% \| 95.84% \| 96.01% \| 99.30% \|
	\| Ogma \| MLP \| 95.44% \| 96.16% \| 94.89% \| 97.46% \| 99.37% \|
	\| MiniLM \| LogReg \| 94.59% \| 95.38% \| 95.46% \| 95.29% \| 98.92% \|
	\| MiniLM \| MLP \| 93.95% \| 94.85% \| 94.59% \| 95.11% \| 98.92% \|

	Ogma leads across all metrics: +0.78% F1 (MLP), +0.55% F1 (LR). Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%).

	#### Summary

	\| Task \| Ogma best F1 \| MiniLM best F1 \| Δ \|
	\|---\|---\|---\|---\|
	\| Jigsaw Toxicity \| 88.26% (LR) \| 91.24% (MLP) \| −2.98% \|
	\| deepset Injection \| 90.27% (MLP) \| 86.24% (MLP) \| +4.03% \|
	\| neuralchemy Injection \| 96.16% (MLP) \| 95.38% (LR) \| +0.78% \|

	Ogma is a stronger feature extractor for prompt injection detection — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice.
	---

	## Architecture

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| Custom Transformer \|
	\| Internal dim (`d_model`) \| 256 \|
	\| Output dim (`d_output`) \| 256 \|
	\| Transformer layers \| 12 \|
	\| Attention heads \| 4 \|
	\| Vocabulary \| 30,000 (SentencePiece / AlbertTokenizer) \|
	\| Max sequence length \| 1,024 tokens \|
	\| Pooling \| Mean pooling \|
	\| Task tokens \| `[QRY]` (query), `[DOC]` (document), `[SYM]` (symmetric) \|
	\| Matryoshka dims \| [32, 64, 128, 256] \|
	\| Output normalisation \| L2 (unit sphere) \|
	\| Parameters \| 13.3M \|
	\| Model file \| `model.safetensors` (51 MB) \|

	Key design choices:

	- Task token prepend: A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. Recommended inference route: `[QRY]`/`[QRY]` — encode both queries and documents with `[QRY]`; this benchmarked highest on MTEB. `[SYM]` everywhere is the next-best symmetric alternative. We do not recommend `[DOC]` at inference time — it is exposed for downstream fine-tuning, not as an asymmetric query/document route.
	- Matryoshka training: The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
	- Mean pooling: The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
	- L2 normalisation: All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.

	---

	## Usage

	### Installation

	```bash
	pip install torch tokenizers transformers huggingface_hub
	```

	### Basic Encoding

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("axiotic/ogma-base", trust_remote_code=True).eval()
	tok = AutoTokenizer.from_pretrained("axiotic/ogma-base", trust_remote_code=True)

	sentences = [
	"The quick brown fox jumps over the lazy dog",
	"A fast auburn vulpine leaps over an idle canine",
	"The capital of France is Paris",
	]
	emb = model.embed(sentences, task="sym", tokenizer=tok)
	# emb.shape → (256,) per sentence, L2-normalised

	sim = (emb[0] @ emb[1]).item() # cosine sim == dot product (L2-normalised)
	print(f"paraphrase: {sim:.4f}")
	```

	`task="sym"` is a safe default for all similarity tasks (STS, clustering,
	classification) and for retrieval. Ogma is trained for symmetric routing —
	queries and documents are always encoded with the same task token. The two
	recommended routes are:

	1. `[SYM]` for everything (the safe default above), or
	2. `[QRY]`/`[QRY]` — encode both queries and documents with `task="qry"`.

	Try both on your downstream task; either can win depending on the data, and
	`[QRY]`/`[QRY]` is the natural starting point when fine-tuning a classifier or
	retrieval head on top of the embeddings.

	### Retrieval

	Encode queries and documents with the same task token. Below we show the `[QRY]`/`[QRY]` route — both calls use `task="qry"`. This is intentional (Ogma is symmetric, not asymmetric); swap in `task="sym"` to compare the SYM route on your data.

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("axiotic/ogma-base", trust_remote_code=True).eval()
	tok = AutoTokenizer.from_pretrained("axiotic/ogma-base", trust_remote_code=True)

	queries = ["What is knowledge distillation?"]
	docs = [
	"Knowledge distillation trains a smaller student model to mimic a larger teacher.",
	"The Eiffel Tower is in Paris, France.",
	]

	q = model.embed(queries, task="qry", tokenizer=tok) # (256,) per query — symmetric: both sides use qry
	d = model.embed(docs, task="qry", tokenizer=tok) # (256,) per doc — not a typo; Ogma is symmetric

	scores = (q @ d.T).squeeze(0) # cosine sim (L2-normalised, dot == cosine)
	print(scores.tolist()) # [higher, lower] — first doc is relevant
	```

	### Matryoshka — Flexible Dimensionality

	Ogma is trained with Matryoshka Representation Learning. Slice and re-normalise
	to any supported sub-dimension with no retraining:

	```python
	import torch, torch.nn.functional as F
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("axiotic/ogma-base", trust_remote_code=True).eval()
	tok = AutoTokenizer.from_pretrained("axiotic/ogma-base", trust_remote_code=True)

	emb = model.embed(["hello world"], task="sym", tokenizer=tok) # full 256d

	for d in model.config.matryoshka_dims:
	sub = F.normalize(emb[:, :d], dim=-1)
	print(f"{d}d norm={sub.norm(dim=-1).item():.4f}")
	```
	## Model Family

	\| Model \| Params \| Size \| MTEB Avg \| Class \| Clust \| PairClass \| Rerank \| Ret \| STS \| Summ \| d_out \| Context \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| [ogma-large](https://huggingface.co/axiotic/ogma-large) \| 32.4M \| 124 MB \| 57.41 \| 68.6 \| 41.6 \| 84.0 \| 53.1 \| 43.7 \| 83.7 \| 30.9 \| 256 \| 1024 \|
	\| [ogma-base](https://huggingface.co/axiotic/ogma-base) \| 13.3M \| 51 MB \| 57.02 \| 67.74 \| 41.49 \| 83.73 \| 51.25 \| 42.36 \| 82.84 \| 29.73 \| 256 \| 1024 \|
	\| [ogma-small](https://huggingface.co/axiotic/ogma-small) \| 8.6M \| 33 MB \| 56.32 \| 66.49 \| 40.69 \| 82.91 \| 50.51 \| 42.05 \| 82.00 \| 29.59 \| 256 \| 1024 \|
	\| [ogma-mini](https://huggingface.co/axiotic/ogma-mini) \| 3.5M \| 14 MB \| 53.06 \| 61.77 \| 37.38 \| 79.66 \| 47.39 \| 36.21 \| 77.71 \| 31.33 \| 256 \| 1024 \|
	\| [ogma-micro](https://huggingface.co/axiotic/ogma-micro) \| 2.3M \| 8.9 MB \| 52.18 \| 59.53 \| 36.88 \| 78.62 \| 49.74 \| 33.09 \| 75.63 \| 31.77 \| 128 \| 1024 \|
	\| all-MiniLM-L6-v2 \| 22.7M \| 87 MB \| 56.09 \| 62.62 \| 41.94 \| 82.37 \| 58.04 \| 41.95 \| 78.90 \| 30.81 \| 384 \| 256 \|
	\| potion-base-32M \| 32.0M \| 123 MB \| 51.22 \| 66.0 \| 39.2 \| 78.2 \| 50.9 \| 32.2 \| 73.9 \| 29.8 \| 256 \| inf \|
	\| potion-base-8M \| 7.6M \| 29 MB \| 50.03 \| 64.44 \| 32.93 \| 76.62 \| 49.73 \| 31.71 \| 73.24 \| 29.28 \| 256 \| inf \|

	All Ogma: MTEB 2.10.7, 66-task standard English set, category-averaged.
	MiniLM/Potion: published scores from the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).

	---

	## Training Details

	\| Property \| Value \|
	\|---\|---\|
	\| Teacher model \| `jinaai/jina-embeddings-v5-text-small` (CC-BY-NC-4.0) \|
	\| Training paradigm \| Knowledge distillation from cached teacher embeddings \|
	\| Training data \| ~7M curated English sentence pairs \|
	\| Tokenizer \| AlbertTokenizer (SentencePiece, vocab=30,000) \|
	\| Embedding initialisation \| PCA of teacher embeddings (128d) projected to d_model \|
	\| Loss \| Distillation + contrastive (balanced schedule) \|
	\| Evaluation framework \| MTEB 2.10.7 \|

	---

	## Limitations

	- No text generation. Ogma is an encoder-only embedding model.
	- English only. Training data and evaluation are English-only.
	- Slower than static models. Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences.
	- Non-commercial licence. Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run).
	- Reranking gap. Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -6.8). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking.

	---

	## Licence & Attribution

	This model is released under [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/) (Creative Commons Attribution-NonCommercial 4.0 International).

	Required attribution (must be included in all uses):

	> This model was trained via knowledge distillation from
	> `jina-embeddings-v5-text-small` (https://huggingface.co/jinaai/jina-embeddings-v5-text-small)
	> by Jina AI, licensed under CC-BY-NC-4.0.

	---

	## Citation

	```bibtex
	@misc{ogma2026,
	title = {Ogma: Efficient Dense Retrieval via Structured Embeddings},
	author = {Axiotic AI},
	year = {2026},
	url = {https://huggingface.co/axiotic/ogma-base},
	}
	```