Aptivra-Base-110M

A compact (110M-parameter) sentence-embedding model for skill routing and semantic retrieval, fine-tuned from intfloat/e5-base-v2. It encodes a user request and a catalog of skill/tool documents into 768-dim vectors so an agent, router, or MCP server can retrieve the most relevant candidates.

Aptivra-Base-110M is an embedding model, not a chat model.

✅ Use it for	❌ Do not use it for
query / document embeddings	chat completion
skill routing	instruction following
semantic retrieval	text generation
vector search	(it produces vectors, not text)
candidate ranking

Shipped in multiple runtimes — PyTorch (safetensors), ONNX, OpenVINO, and GGUF (in the companion repo raghunath1/Aptivra-Base-110M-GGUF) — all producing the same 768-dim embedding.

⚠️ Experimental — validate before production use

This is a research preview, not a validated/production-certified system. What is proven: the base router beats the intfloat/e5-base-v2 baseline at the embedding level on the clean routing eval (Recall@1 0.820 → 0.958, Δ +0.138, 95% CI [0.102, 0.175]). What is NOT validated: per-pack routing quality (domain packs are experimental, unvalidated), high-stakes packs (medical/legal/finance/… are research-only), and the governance layer (not in the served path). A wrong route can cause downstream harm even though this model does not produce the final answer — validate behavior and evidence for your use case, and apply downstream validation, policy, safety, and permission checks before acting on a route. Full disclaimer, evidence, and gate status in the source repository.

Input format (important)

Unlike vanilla e5, this fine-tune was trained and evaluated on plain text — no query: / passage: prefix. Feed the raw request and the raw skill/document text. The model applies mean pooling and returns L2-normalized vectors; compare them with cosine similarity (a dot product, since they are normalized). Adding e5-style prefixes will not reproduce the validated numbers below.

Available formats

This repo (PyTorch + ONNX + OpenVINO):

Path	Backend	Precision	Size	Best for
`model.safetensors`	PyTorch / sentence-transformers	fp32	438 MB	reference; training; GPU
`onnx/model.onnx`	ONNX Runtime	fp32	436 MB	portable CPU/GPU inference
`onnx/model_O3.onnx`	ONNX Runtime	fp32 (graph-opt O3)	436 MB	fastest fp32 on CPU
`onnx/model_qint8_avx2.onnx`	ONNX Runtime	int8	110 MB	x86 CPU (AVX2)
`onnx/model_qint8_avx512.onnx`	ONNX Runtime	int8	110 MB	x86 CPU (AVX-512)
`onnx/model_qint8_avx512_vnni.onnx`	ONNX Runtime	int8	110 MB	x86 CPU (AVX-512 VNNI, e.g. Ice Lake+)
`onnx/model_qint8_arm64.onnx`	ONNX Runtime	int8	110 MB	ARM CPU (Apple Silicon, AWS Graviton)
`openvino/openvino_model.xml` (+`.bin`)	OpenVINO	fp32	436 MB	Intel CPU/iGPU/NPU
`openvino/openvino_model_qint8.xml` (+`.bin`)	OpenVINO	int8 (weight-only)	110 MB	Intel CPU, smaller

GGUF (llama.cpp) lives in the companion repo raghunath1/Aptivra-Base-110M-GGUF: F16, Q8_0, Q4_K_M.

Backend parity (Recall@1)

Identical 400-query routing eval, ~5,897-skill corpus, plain text, top-1 retrieval. Every backend receives the same (512-token-truncated) input, so deltas are pure backend/quantization effect. safetensors is the reference.

Backend	Precision	Fidelity vs fp32 (mean cosine)	Recall@1
safetensors (PyTorch)	fp32	1.00000 (reference)	0.958
ONNX (`model.onnx`, `model_O3.onnx`)	fp32	1.00000	≡ reference
ONNX int8 (`_qint8_`)	int8	0.98957 ¹	≈ reference
OpenVINO (`openvino_model`)	fp32	0.99986	≡ reference
OpenVINO int8 (`openvino_model_qint8`)	int8 (weight-only)	0.99938	≡ reference
GGUF F16	F16	0.99999	≡ reference
GGUF Q8_0	Q8_0	0.99984	≡ reference
GGUF Q4_K_M	Q4_K_M	0.98618	≈ reference

Method: each backend encodes the identical plain-text routing eval; fidelity = mean cosine of its embeddings to the fp32 reference. When fidelity ≈ 1.0, Recall@1 equals the reference by construction; the int8 / Q4 rows perturb embeddings ~1–1.4% (trading a little accuracy for size/speed).

¹ Measured on the arm64 int8 build (Apple Silicon). The avx2 / avx512 / avx512_vnni files use the same dynamic-int8 recipe for the respective x86 instruction sets.

OpenVINO int8 is weight-only quantization (weights int8, activations fp) — chosen because static activation PTQ collapses this embedding model. Weight-only preserves fidelity (0.99938) at the int8 size (110 MB).

int8 / Q4 are size-and-speed tradeoffs. The table is the honest record of what each precision costs in retrieval accuracy — pick the smallest one whose Recall@1 you can live with.

Usage

Works the same on Windows, Linux, and macOS unless noted. Examples assume Python 3.9+.

1. Python — sentence-transformers (safetensors, recommended reference)

pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("raghunath1/Aptivra-Base-110M")          # PyTorch
queries = ["set up a browser automation task"]
skills  = ["Automate a web browser to click, type, and navigate pages."]

q = model.encode(queries, normalize_embeddings=True)
s = model.encode(skills,  normalize_embeddings=True)
print((q @ s.T))   # cosine similarity

2. ONNX Runtime

Via sentence-transformers (picks the right CPU kernel automatically):

pip install "sentence-transformers[onnx]"        # CPU
# pip install "sentence-transformers[onnx-gpu]"  # NVIDIA GPU

from sentence_transformers import SentenceTransformer

# fp32:
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="onnx")

# int8 — choose the file matching your CPU:
#   x86 (most Intel/AMD): onnx/model_qint8_avx512_vnni.onnx  (or _avx2 on older CPUs)
#   ARM (Apple Silicon, Graviton): onnx/model_qint8_arm64.onnx
model = SentenceTransformer(
    "raghunath1/Aptivra-Base-110M", backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_arm64.onnx",
                  "provider": "CPUExecutionProvider"},   # on macOS, pin CPU to avoid CoreML EP
)
emb = model.encode(["semantic search query"], normalize_embeddings=True)

Raw onnxruntime (no sentence-transformers), with manual mean-pool + L2-norm:

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("raghunath1/Aptivra-Base-110M")
sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
enc = tok(["semantic search query"], padding=True, truncation=True, max_length=512, return_tensors="np")
out = sess.run(None, {k: enc[k] for k in ("input_ids","attention_mask","token_type_ids") if k in enc})[0]
mask = enc["attention_mask"][..., None]
emb = (out * mask).sum(1) / np.clip(mask.sum(1), 1e-9, None)   # mean pool
emb /= np.linalg.norm(emb, axis=1, keepdims=True)             # L2 normalize

Windows: onnxruntime ships prebuilt wheels; the avx512_vnni int8 file is fastest on recent Intel.
Linux: same; on AWS Graviton / ARM servers use the arm64 int8 file.
macOS (Apple Silicon): use the arm64 int8 file and pin CPUExecutionProvider (the CoreML EP can fail to build this graph).

3. Transformers.js (browser / Node.js)

import { pipeline } from '@huggingface/transformers';
const extractor = await pipeline('feature-extraction', 'raghunath1/Aptivra-Base-110M',
  { dtype: 'q8' });                       // uses the ONNX int8 weights
const emb = await extractor(['semantic search query'],
  { pooling: 'mean', normalize: true });

Runs in-browser (WebAssembly/WebGPU) and in Node on Windows/Linux/macOS — same code.

4. OpenVINO (Intel CPU / iGPU / NPU)

pip install "sentence-transformers[openvino]"

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="openvino")
# int8: model_kwargs={"file_name": "openvino_model_qint8.xml"}
emb = model.encode(["semantic search query"], normalize_embeddings=True)

Best on Intel hardware (Core/Xeon, Arc, NPU). The Python API is identical on Windows/Linux/macOS; the OpenVINO runtime wheel is installed automatically.

5. llama.cpp / GGUF

GGUF builds are in raghunath1/Aptivra-Base-110M-GGUF. Use embedding mode with mean pooling + L2 normalize:

llama-embedding -m Aptivra-Base-110M-Q8_0.gguf -p "semantic search query" \
  --pooling mean --embd-normalize 2

LM Studio / Ollama caveat: these tools are built around chat/completion models. This is an embedding model — use it only through an embeddings endpoint (e.g. llama-server → POST /v1/embeddings, or Ollama's /api/embeddings), not the chat UI. It will not generate text.

Evaluation

The proven claim (embedding-level lift over baseline). Measured with the reranker OFF — isolating the embedding's own contribution — against intfloat/e5-base-v2, paired bootstrap, on the current corpus:

Eval set (rerank OFF)	baseline	Aptivra	Δ Recall@1	95% CI
Human routing	0.820	0.958	+0.138	[0.102, 0.175]
Weak-pair (noisy)	0.654	0.701	+0.047	[0.040, 0.054]
Adversarial	0.784	0.817	+0.033	[-0.006, 0.071] (not significant)

Full-pipeline numbers (rerank ON) — higher, but the adversarial gains are carried by a lexical reranker that is NOT part of this download:

Evaluation (rerank ON)	Metric
Human routing Recall@1	0.985
Adversarial Recall@1	0.973
Hard-negative pairwise accuracy	0.994

⚠️ These artifacts are the embedding model only. The reranker that produces the rerank-ON numbers lives in the source repository and is not included here. With this model alone you get the rerank-OFF (embedding-only) results — e.g. adversarial Recall@1 ≈ 0.82, not 0.97. The rerank-ON numbers are measured on the routing eval fixtures, and the reranker's structural rules are tuned to those fixture patterns — treat them as in-distribution diagnostics, not a guarantee of open-world generalization.

Metrics are pinned to the current ~5,897-skill corpus snapshot and must be re-derived after corpus changes. Per-pack/domain routing quality is not included here (experimental, unvalidated). Evidence reports (docs/reports/phase-1/) are in the source repository.

How these artifacts were produced

ONNX / OpenVINO: exported from the canonical safetensors with sentence-transformers (export_optimized_onnx_model for O3; export_dynamic_quantized_onnx_model for the int8 variants; backend="openvino" + static int8 PTQ for OpenVINO).
GGUF: converted with llama.cpp convert_hf_to_gguf.py (F16), then llama-quantize for Q8_0 / Q4_K_M.
Every derived/quantized artifact is gated by the backend parity table above before release.

Training Data

Tuned on curated skill-routing data derived from a local skill corpus — positive skill-query pairs, hard negatives, human-style routing queries, and adversarial routing examples.

Limitations

The model is only as good as the skill corpus and routing data used with it. Use it as a retrieval/ranking component, not as the final authority for tool execution. Production systems should keep top-k candidates, apply reranking, and enforce policy, safety, and permission checks downstream. Medical, legal, financial, or safety-critical use is research-only — validate on your own cases first.

License

MIT. Fine-tuned from intfloat/e5-base-v2 (MIT).

Downloads last month: 116

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for raghunath1/Aptivra-Base-110M

Base model

intfloat/e5-base-v2

Quantized

(14)

this model

Quantizations

2 models