Aptivra-Base-110M

A compact (110M-parameter) sentence-embedding model for skill routing and semantic retrieval, fine-tuned from intfloat/e5-base-v2. It encodes a user request and a catalog of skill/tool documents into 768-dim vectors so an agent, router, or MCP server can retrieve the most relevant candidates.

Aptivra-Base-110M is an embedding model, not a chat model.

Aptivra skill-routing architecture

βœ… Use it for ❌ Do not use it for
query / document embeddings chat completion
skill routing instruction following
semantic retrieval text generation
vector search (it produces vectors, not text)
candidate ranking

Shipped in multiple runtimes β€” PyTorch (safetensors), ONNX, OpenVINO, and GGUF (in the companion repo raghunath1/Aptivra-Base-110M-GGUF) β€” all producing the same 768-dim embedding.

⚠️ Experimental β€” validate before production use

This is a research preview, not a validated/production-certified system. What is proven: the base router beats the intfloat/e5-base-v2 baseline at the embedding level on the clean routing eval (Recall@1 0.820 β†’ 0.958, Ξ” +0.138, 95% CI [0.102, 0.175]). What is NOT validated: per-pack routing quality (domain packs are experimental, unvalidated), high-stakes packs (medical/legal/finance/… are research-only), and the governance layer (not in the served path). A wrong route can cause downstream harm even though this model does not produce the final answer β€” validate behavior and evidence for your use case, and apply downstream validation, policy, safety, and permission checks before acting on a route. Full disclaimer, evidence, and gate status in the source repository.

Input format (important)

Unlike vanilla e5, this fine-tune was trained and evaluated on plain text β€” no query: / passage: prefix. Feed the raw request and the raw skill/document text. The model applies mean pooling and returns L2-normalized vectors; compare them with cosine similarity (a dot product, since they are normalized). Adding e5-style prefixes will not reproduce the validated numbers below.

Available formats

This repo (PyTorch + ONNX + OpenVINO):

Path Backend Precision Size Best for
model.safetensors PyTorch / sentence-transformers fp32 438 MB reference; training; GPU
onnx/model.onnx ONNX Runtime fp32 436 MB portable CPU/GPU inference
onnx/model_O3.onnx ONNX Runtime fp32 (graph-opt O3) 436 MB fastest fp32 on CPU
onnx/model_qint8_avx2.onnx ONNX Runtime int8 110 MB x86 CPU (AVX2)
onnx/model_qint8_avx512.onnx ONNX Runtime int8 110 MB x86 CPU (AVX-512)
onnx/model_qint8_avx512_vnni.onnx ONNX Runtime int8 110 MB x86 CPU (AVX-512 VNNI, e.g. Ice Lake+)
onnx/model_qint8_arm64.onnx ONNX Runtime int8 110 MB ARM CPU (Apple Silicon, AWS Graviton)
openvino/openvino_model.xml (+.bin) OpenVINO fp32 436 MB Intel CPU/iGPU/NPU
openvino/openvino_model_qint8.xml (+.bin) OpenVINO int8 (weight-only) 110 MB Intel CPU, smaller

GGUF (llama.cpp) lives in the companion repo raghunath1/Aptivra-Base-110M-GGUF: F16, Q8_0, Q4_K_M.

Backend parity (Recall@1)

Identical 400-query routing eval, ~5,897-skill corpus, plain text, top-1 retrieval. Every backend receives the same (512-token-truncated) input, so deltas are pure backend/quantization effect. safetensors is the reference.

Backend Precision Fidelity vs fp32 (mean cosine) Recall@1
safetensors (PyTorch) fp32 1.00000 (reference) 0.958
ONNX (model.onnx, model_O3.onnx) fp32 1.00000 ≑ reference
ONNX int8 (*_qint8_*) int8 0.98957 ΒΉ β‰ˆ reference
OpenVINO (openvino_model) fp32 0.99986 ≑ reference
OpenVINO int8 (openvino_model_qint8) int8 (weight-only) 0.99938 ≑ reference
GGUF F16 F16 0.99999 ≑ reference
GGUF Q8_0 Q8_0 0.99984 ≑ reference
GGUF Q4_K_M Q4_K_M 0.98618 β‰ˆ reference

Method: each backend encodes the identical plain-text routing eval; fidelity = mean cosine of its embeddings to the fp32 reference. When fidelity β‰ˆ 1.0, Recall@1 equals the reference by construction; the int8 / Q4 rows perturb embeddings ~1–1.4% (trading a little accuracy for size/speed).

ΒΉ Measured on the arm64 int8 build (Apple Silicon). The avx2 / avx512 / avx512_vnni files use the same dynamic-int8 recipe for the respective x86 instruction sets.

OpenVINO int8 is weight-only quantization (weights int8, activations fp) β€” chosen because static activation PTQ collapses this embedding model. Weight-only preserves fidelity (0.99938) at the int8 size (110 MB).

int8 / Q4 are size-and-speed tradeoffs. The table is the honest record of what each precision costs in retrieval accuracy β€” pick the smallest one whose Recall@1 you can live with.

Usage

Works the same on Windows, Linux, and macOS unless noted. Examples assume Python 3.9+.

1. Python β€” sentence-transformers (safetensors, recommended reference)

pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("raghunath1/Aptivra-Base-110M")          # PyTorch
queries = ["set up a browser automation task"]
skills  = ["Automate a web browser to click, type, and navigate pages."]

q = model.encode(queries, normalize_embeddings=True)
s = model.encode(skills,  normalize_embeddings=True)
print((q @ s.T))   # cosine similarity

2. ONNX Runtime

Via sentence-transformers (picks the right CPU kernel automatically):

pip install "sentence-transformers[onnx]"        # CPU
# pip install "sentence-transformers[onnx-gpu]"  # NVIDIA GPU
from sentence_transformers import SentenceTransformer

# fp32:
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="onnx")

# int8 β€” choose the file matching your CPU:
#   x86 (most Intel/AMD): onnx/model_qint8_avx512_vnni.onnx  (or _avx2 on older CPUs)
#   ARM (Apple Silicon, Graviton): onnx/model_qint8_arm64.onnx
model = SentenceTransformer(
    "raghunath1/Aptivra-Base-110M", backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_arm64.onnx",
                  "provider": "CPUExecutionProvider"},   # on macOS, pin CPU to avoid CoreML EP
)
emb = model.encode(["semantic search query"], normalize_embeddings=True)

Raw onnxruntime (no sentence-transformers), with manual mean-pool + L2-norm:

import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("raghunath1/Aptivra-Base-110M")
sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
enc = tok(["semantic search query"], padding=True, truncation=True, max_length=512, return_tensors="np")
out = sess.run(None, {k: enc[k] for k in ("input_ids","attention_mask","token_type_ids") if k in enc})[0]
mask = enc["attention_mask"][..., None]
emb = (out * mask).sum(1) / np.clip(mask.sum(1), 1e-9, None)   # mean pool
emb /= np.linalg.norm(emb, axis=1, keepdims=True)             # L2 normalize
  • Windows: onnxruntime ships prebuilt wheels; the avx512_vnni int8 file is fastest on recent Intel.
  • Linux: same; on AWS Graviton / ARM servers use the arm64 int8 file.
  • macOS (Apple Silicon): use the arm64 int8 file and pin CPUExecutionProvider (the CoreML EP can fail to build this graph).

3. Transformers.js (browser / Node.js)

import { pipeline } from '@huggingface/transformers';
const extractor = await pipeline('feature-extraction', 'raghunath1/Aptivra-Base-110M',
  { dtype: 'q8' });                       // uses the ONNX int8 weights
const emb = await extractor(['semantic search query'],
  { pooling: 'mean', normalize: true });

Runs in-browser (WebAssembly/WebGPU) and in Node on Windows/Linux/macOS β€” same code.

4. OpenVINO (Intel CPU / iGPU / NPU)

pip install "sentence-transformers[openvino]"
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="openvino")
# int8: model_kwargs={"file_name": "openvino_model_qint8.xml"}
emb = model.encode(["semantic search query"], normalize_embeddings=True)

Best on Intel hardware (Core/Xeon, Arc, NPU). The Python API is identical on Windows/Linux/macOS; the OpenVINO runtime wheel is installed automatically.

5. llama.cpp / GGUF

GGUF builds are in raghunath1/Aptivra-Base-110M-GGUF. Use embedding mode with mean pooling + L2 normalize:

llama-embedding -m Aptivra-Base-110M-Q8_0.gguf -p "semantic search query" \
  --pooling mean --embd-normalize 2

LM Studio / Ollama caveat: these tools are built around chat/completion models. This is an embedding model β€” use it only through an embeddings endpoint (e.g. llama-server β†’ POST /v1/embeddings, or Ollama's /api/embeddings), not the chat UI. It will not generate text.

Evaluation

The proven claim (embedding-level lift over baseline). Measured with the reranker OFF β€” isolating the embedding's own contribution β€” against intfloat/e5-base-v2, paired bootstrap, on the current corpus:

Eval set (rerank OFF) baseline Aptivra Ξ” Recall@1 95% CI
Human routing 0.820 0.958 +0.138 [0.102, 0.175]
Weak-pair (noisy) 0.654 0.701 +0.047 [0.040, 0.054]
Adversarial 0.784 0.817 +0.033 [-0.006, 0.071] (not significant)

Full-pipeline numbers (rerank ON) β€” higher, but the adversarial gains are carried by a lexical reranker that is NOT part of this download:

Evaluation (rerank ON) Metric
Human routing Recall@1 0.985
Adversarial Recall@1 0.973
Hard-negative pairwise accuracy 0.994

⚠️ These artifacts are the embedding model only. The reranker that produces the rerank-ON numbers lives in the source repository and is not included here. With this model alone you get the rerank-OFF (embedding-only) results β€” e.g. adversarial Recall@1 β‰ˆ 0.82, not 0.97. The rerank-ON numbers are measured on the routing eval fixtures, and the reranker's structural rules are tuned to those fixture patterns β€” treat them as in-distribution diagnostics, not a guarantee of open-world generalization.

Metrics are pinned to the current ~5,897-skill corpus snapshot and must be re-derived after corpus changes. Per-pack/domain routing quality is not included here (experimental, unvalidated). Evidence reports (docs/reports/phase-1/) are in the source repository.

How these artifacts were produced

  • ONNX / OpenVINO: exported from the canonical safetensors with sentence-transformers (export_optimized_onnx_model for O3; export_dynamic_quantized_onnx_model for the int8 variants; backend="openvino" + static int8 PTQ for OpenVINO).
  • GGUF: converted with llama.cpp convert_hf_to_gguf.py (F16), then llama-quantize for Q8_0 / Q4_K_M.
  • Every derived/quantized artifact is gated by the backend parity table above before release.

Training Data

Tuned on curated skill-routing data derived from a local skill corpus β€” positive skill-query pairs, hard negatives, human-style routing queries, and adversarial routing examples.

Limitations

The model is only as good as the skill corpus and routing data used with it. Use it as a retrieval/ranking component, not as the final authority for tool execution. Production systems should keep top-k candidates, apply reranking, and enforce policy, safety, and permission checks downstream. Medical, legal, financial, or safety-critical use is research-only β€” validate on your own cases first.

License

MIT. Fine-tuned from intfloat/e5-base-v2 (MIT).

Downloads last month
116
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for raghunath1/Aptivra-Base-110M

Quantized
(14)
this model
Quantizations
2 models