Instructions to use raghunath1/Aptivra-Base-110M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use raghunath1/Aptivra-Base-110M with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("raghunath1/Aptivra-Base-110M") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Aptivra-Base-110M
A compact (110M-parameter) sentence-embedding model for skill routing and semantic
retrieval, fine-tuned from intfloat/e5-base-v2.
It encodes a user request and a catalog of skill/tool documents into 768-dim vectors so an agent,
router, or MCP server can retrieve the most relevant candidates.
Aptivra-Base-110M is an embedding model, not a chat model.
| β Use it for | β Do not use it for |
|---|---|
| query / document embeddings | chat completion |
| skill routing | instruction following |
| semantic retrieval | text generation |
| vector search | (it produces vectors, not text) |
| candidate ranking |
Shipped in multiple runtimes β PyTorch (safetensors), ONNX, OpenVINO, and GGUF (in the
companion repo raghunath1/Aptivra-Base-110M-GGUF) β
all producing the same 768-dim embedding.
β οΈ Experimental β validate before production use
This is a research preview, not a validated/production-certified system. What is proven: the base router beats the
intfloat/e5-base-v2baseline at the embedding level on the clean routing eval (Recall@1 0.820 β 0.958, Ξ +0.138, 95% CI [0.102, 0.175]). What is NOT validated: per-pack routing quality (domain packs are experimental, unvalidated), high-stakes packs (medical/legal/finance/β¦ are research-only), and the governance layer (not in the served path). A wrong route can cause downstream harm even though this model does not produce the final answer β validate behavior and evidence for your use case, and apply downstream validation, policy, safety, and permission checks before acting on a route. Full disclaimer, evidence, and gate status in the source repository.
Input format (important)
Unlike vanilla e5, this fine-tune was trained and evaluated on plain text β no query: /
passage: prefix. Feed the raw request and the raw skill/document text. The model applies mean
pooling and returns L2-normalized vectors; compare them with cosine similarity (a dot
product, since they are normalized). Adding e5-style prefixes will not reproduce the validated
numbers below.
Available formats
This repo (PyTorch + ONNX + OpenVINO):
| Path | Backend | Precision | Size | Best for |
|---|---|---|---|---|
model.safetensors |
PyTorch / sentence-transformers | fp32 | 438 MB | reference; training; GPU |
onnx/model.onnx |
ONNX Runtime | fp32 | 436 MB | portable CPU/GPU inference |
onnx/model_O3.onnx |
ONNX Runtime | fp32 (graph-opt O3) | 436 MB | fastest fp32 on CPU |
onnx/model_qint8_avx2.onnx |
ONNX Runtime | int8 | 110 MB | x86 CPU (AVX2) |
onnx/model_qint8_avx512.onnx |
ONNX Runtime | int8 | 110 MB | x86 CPU (AVX-512) |
onnx/model_qint8_avx512_vnni.onnx |
ONNX Runtime | int8 | 110 MB | x86 CPU (AVX-512 VNNI, e.g. Ice Lake+) |
onnx/model_qint8_arm64.onnx |
ONNX Runtime | int8 | 110 MB | ARM CPU (Apple Silicon, AWS Graviton) |
openvino/openvino_model.xml (+.bin) |
OpenVINO | fp32 | 436 MB | Intel CPU/iGPU/NPU |
openvino/openvino_model_qint8.xml (+.bin) |
OpenVINO | int8 (weight-only) | 110 MB | Intel CPU, smaller |
GGUF (llama.cpp) lives in the companion repo
raghunath1/Aptivra-Base-110M-GGUF:
F16, Q8_0, Q4_K_M.
Backend parity (Recall@1)
Identical 400-query routing eval, ~5,897-skill corpus, plain text, top-1 retrieval. Every
backend receives the same (512-token-truncated) input, so deltas are pure backend/quantization
effect. safetensors is the reference.
| Backend | Precision | Fidelity vs fp32 (mean cosine) | Recall@1 |
|---|---|---|---|
| safetensors (PyTorch) | fp32 | 1.00000 (reference) | 0.958 |
ONNX (model.onnx, model_O3.onnx) |
fp32 | 1.00000 | β‘ reference |
ONNX int8 (*_qint8_*) |
int8 | 0.98957 ΒΉ | β reference |
OpenVINO (openvino_model) |
fp32 | 0.99986 | β‘ reference |
OpenVINO int8 (openvino_model_qint8) |
int8 (weight-only) | 0.99938 | β‘ reference |
| GGUF F16 | F16 | 0.99999 | β‘ reference |
| GGUF Q8_0 | Q8_0 | 0.99984 | β‘ reference |
| GGUF Q4_K_M | Q4_K_M | 0.98618 | β reference |
Method: each backend encodes the identical plain-text routing eval; fidelity = mean cosine of its embeddings to the fp32 reference. When fidelity β 1.0, Recall@1 equals the reference by construction; the int8 / Q4 rows perturb embeddings ~1β1.4% (trading a little accuracy for size/speed).
ΒΉ Measured on the
arm64int8 build (Apple Silicon). Theavx2/avx512/avx512_vnnifiles use the same dynamic-int8 recipe for the respective x86 instruction sets.OpenVINO int8 is weight-only quantization (weights int8, activations fp) β chosen because static activation PTQ collapses this embedding model. Weight-only preserves fidelity (0.99938) at the int8 size (110 MB).
int8 / Q4 are size-and-speed tradeoffs. The table is the honest record of what each precision costs in retrieval accuracy β pick the smallest one whose Recall@1 you can live with.
Usage
Works the same on Windows, Linux, and macOS unless noted. Examples assume Python 3.9+.
1. Python β sentence-transformers (safetensors, recommended reference)
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("raghunath1/Aptivra-Base-110M") # PyTorch
queries = ["set up a browser automation task"]
skills = ["Automate a web browser to click, type, and navigate pages."]
q = model.encode(queries, normalize_embeddings=True)
s = model.encode(skills, normalize_embeddings=True)
print((q @ s.T)) # cosine similarity
2. ONNX Runtime
Via sentence-transformers (picks the right CPU kernel automatically):
pip install "sentence-transformers[onnx]" # CPU
# pip install "sentence-transformers[onnx-gpu]" # NVIDIA GPU
from sentence_transformers import SentenceTransformer
# fp32:
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="onnx")
# int8 β choose the file matching your CPU:
# x86 (most Intel/AMD): onnx/model_qint8_avx512_vnni.onnx (or _avx2 on older CPUs)
# ARM (Apple Silicon, Graviton): onnx/model_qint8_arm64.onnx
model = SentenceTransformer(
"raghunath1/Aptivra-Base-110M", backend="onnx",
model_kwargs={"file_name": "onnx/model_qint8_arm64.onnx",
"provider": "CPUExecutionProvider"}, # on macOS, pin CPU to avoid CoreML EP
)
emb = model.encode(["semantic search query"], normalize_embeddings=True)
Raw onnxruntime (no sentence-transformers), with manual mean-pool + L2-norm:
import numpy as np, onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("raghunath1/Aptivra-Base-110M")
sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
enc = tok(["semantic search query"], padding=True, truncation=True, max_length=512, return_tensors="np")
out = sess.run(None, {k: enc[k] for k in ("input_ids","attention_mask","token_type_ids") if k in enc})[0]
mask = enc["attention_mask"][..., None]
emb = (out * mask).sum(1) / np.clip(mask.sum(1), 1e-9, None) # mean pool
emb /= np.linalg.norm(emb, axis=1, keepdims=True) # L2 normalize
- Windows:
onnxruntimeships prebuilt wheels; theavx512_vnniint8 file is fastest on recent Intel. - Linux: same; on AWS Graviton / ARM servers use the
arm64int8 file. - macOS (Apple Silicon): use the
arm64int8 file and pinCPUExecutionProvider(the CoreML EP can fail to build this graph).
3. Transformers.js (browser / Node.js)
import { pipeline } from '@huggingface/transformers';
const extractor = await pipeline('feature-extraction', 'raghunath1/Aptivra-Base-110M',
{ dtype: 'q8' }); // uses the ONNX int8 weights
const emb = await extractor(['semantic search query'],
{ pooling: 'mean', normalize: true });
Runs in-browser (WebAssembly/WebGPU) and in Node on Windows/Linux/macOS β same code.
4. OpenVINO (Intel CPU / iGPU / NPU)
pip install "sentence-transformers[openvino]"
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("raghunath1/Aptivra-Base-110M", backend="openvino")
# int8: model_kwargs={"file_name": "openvino_model_qint8.xml"}
emb = model.encode(["semantic search query"], normalize_embeddings=True)
Best on Intel hardware (Core/Xeon, Arc, NPU). The Python API is identical on Windows/Linux/macOS; the OpenVINO runtime wheel is installed automatically.
5. llama.cpp / GGUF
GGUF builds are in raghunath1/Aptivra-Base-110M-GGUF.
Use embedding mode with mean pooling + L2 normalize:
llama-embedding -m Aptivra-Base-110M-Q8_0.gguf -p "semantic search query" \
--pooling mean --embd-normalize 2
LM Studio / Ollama caveat: these tools are built around chat/completion models. This is an embedding model β use it only through an embeddings endpoint (e.g.
llama-serverβPOST /v1/embeddings, or Ollama's/api/embeddings), not the chat UI. It will not generate text.
Evaluation
The proven claim (embedding-level lift over baseline). Measured with the reranker OFF β
isolating the embedding's own contribution β against intfloat/e5-base-v2, paired bootstrap,
on the current corpus:
| Eval set (rerank OFF) | baseline | Aptivra | Ξ Recall@1 | 95% CI |
|---|---|---|---|---|
| Human routing | 0.820 | 0.958 | +0.138 | [0.102, 0.175] |
| Weak-pair (noisy) | 0.654 | 0.701 | +0.047 | [0.040, 0.054] |
| Adversarial | 0.784 | 0.817 | +0.033 | [-0.006, 0.071] (not significant) |
Full-pipeline numbers (rerank ON) β higher, but the adversarial gains are carried by a lexical reranker that is NOT part of this download:
| Evaluation (rerank ON) | Metric |
|---|---|
| Human routing Recall@1 | 0.985 |
| Adversarial Recall@1 | 0.973 |
| Hard-negative pairwise accuracy | 0.994 |
β οΈ These artifacts are the embedding model only. The reranker that produces the rerank-ON numbers lives in the source repository and is not included here. With this model alone you get the rerank-OFF (embedding-only) results β e.g. adversarial Recall@1 β 0.82, not 0.97. The rerank-ON numbers are measured on the routing eval fixtures, and the reranker's structural rules are tuned to those fixture patterns β treat them as in-distribution diagnostics, not a guarantee of open-world generalization.
Metrics are pinned to the current ~5,897-skill corpus snapshot and must be re-derived after corpus
changes. Per-pack/domain routing quality is not included here (experimental, unvalidated).
Evidence reports (docs/reports/phase-1/) are in the source repository.
How these artifacts were produced
- ONNX / OpenVINO: exported from the canonical safetensors with
sentence-transformers(export_optimized_onnx_modelfor O3;export_dynamic_quantized_onnx_modelfor the int8 variants;backend="openvino"+ static int8 PTQ for OpenVINO). - GGUF: converted with
llama.cppconvert_hf_to_gguf.py(F16), thenllama-quantizeforQ8_0/Q4_K_M. - Every derived/quantized artifact is gated by the backend parity table above before release.
Training Data
Tuned on curated skill-routing data derived from a local skill corpus β positive skill-query pairs, hard negatives, human-style routing queries, and adversarial routing examples.
Limitations
The model is only as good as the skill corpus and routing data used with it. Use it as a retrieval/ranking component, not as the final authority for tool execution. Production systems should keep top-k candidates, apply reranking, and enforce policy, safety, and permission checks downstream. Medical, legal, financial, or safety-critical use is research-only β validate on your own cases first.
License
MIT. Fine-tuned from intfloat/e5-base-v2 (MIT).
- Downloads last month
- 116
