intelli-embed-v2

The best-performing local embedding model for GraphRAG and memory-augmented AI applications.

Built for apps that store, retrieve, and deduplicate personal memories in graph databases β€” intelli-embed-v2 achieves 98% of Azure text-embedding-3-small quality while running entirely on-device at ~10 ms per embedding. No API calls, no rate limits, no data leaving your infrastructure.

Metric Value
Sep (SW-engineering, 20 pairs) 0.484 (GOOD)
vs Azure text-embedding-3-large 94%
vs Azure text-embedding-3-small 98%
Inference p50 (INT8 ONNX, CPU) ~10 ms
Embedding dimension 1024
Max sequence length 512 tokens
Model size (fp32 safetensors) 2.17 GB
Model size (INT8 ONNX) 542 MB

Training

Fine-tuned from Snowflake/snowflake-arctic-embed-l-v2.0 using a three-phase curriculum that distills knowledge from Azure text-embedding-3-large across 200k real-world sentences.

Three-phase curriculum fine-tuning on arctic-l-v2:

Phase Loss Data Duration
1 GISTEmbedLoss (mxbai as teacher) 100k SW-engineering pairs ~1.6h
2 MSE distillation from azure-large embeddings 200k sentences ~9 min
3 Hard-negative MultipleNegativesRankingLoss 7107 mined triplets ~77s

Usage

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("serhiiseletskyi/intelli-embed-v2")
embeddings = model.encode(["Hello world", "Another sentence"])
print(embeddings.shape)  # (2, 1024)

With ONNX Runtime (INT8, recommended for CPU inference)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("serhiiseletskyi/intelli-embed-v2")
session = ort.InferenceSession("onnx/model_quantized.onnx")

def embed(texts):
    enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="np")
    out = session.run(None, {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"]})[0]
    # mean pool + L2 normalize
    mask = enc["attention_mask"][..., None].astype(np.float32)
    pooled = (out * mask).sum(1) / mask.sum(1)
    return pooled / np.linalg.norm(pooled, axis=1, keepdims=True)

vecs = embed(["Hello world", "Another sentence"])
print(vecs.shape)  # (2, 1024)

With ort-node (Node.js / TypeScript)

import * as ort from "onnxruntime-node";
import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("serhiiseletskyi/intelli-embed-v2");
const session = await ort.InferenceSession.create("onnx/model_quantized.onnx");

async function embed(texts: string[]): Promise<number[][]> {
  const enc = await tokenizer(texts, { padding: true, truncation: true, max_length: 512 });
  const inputIds = new ort.Tensor("int64", enc.input_ids.data, enc.input_ids.dims);
  const attentionMask = new ort.Tensor("int64", enc.attention_mask.data, enc.attention_mask.dims);
  const result = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
  // result["last_hidden_state"] β†’ mean pool β†’ L2 normalize
  // ...
  return vectors;
}

Files

File Size Description
model.safetensors 2.17 GB Full fp32 model weights (sentence-transformers compatible)
onnx/model.onnx 0.4 MB ONNX proto (references external data file)
onnx/model.onnx_data 2.16 GB ONNX external weight data (fp32)
onnx/model_quantized.onnx 542 MB INT8 dynamic quantization (recommended for CPU)
tokenizer.json 16 MB Tokenizer (XLM-RoBERTa based)
1_Pooling/config.json β€” Mean pooling config

Benchmark Results (run15, 2026-02-23)

Evaluated on a 6-suite benchmark including SW-engineering pairs, memory-domain pairs, dedup fitness, asymmetric retrieval, negation safety, and entity description retrieval.

Provider Sep Grade p50ms
azure-large (cloud) 0.515 GOOD ~110
azure-small (cloud) 0.511 GOOD ~80
intelli-embed-v2 (INT8) 0.484 GOOD ~10
arctic-l-v2 (q8, base model) 0.469 GOOD ~10
intelli-ensemble 0.450 EXCELLENT ~86

Sep = mean(PosSim) βˆ’ mean(NegSim) on SW-engineering pairs β€” higher is better.

OpenMemory Use-Case Metrics

Metric Value Notes
memSep 0.439 EXCELLENT β€” personal memory discrimination
dedupGap 0.102 Near-dedup vs not-dedup cosine delta
asyncSep 0.240 FAIR β€” short query β†’ long memory retrieval
negGap 0.026 Negation safety (BM25 gate still recommended)
supSim 0.672 Supersede zone (~0.75–0.92 is ideal)
entSep 0.491 GOOD β€” entity description retrieval

License

Apache 2.0 β€” inherited from base model Snowflake/snowflake-arctic-embed-l-v2.0.

Downloads last month
12
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for serhiiseletskyi/intelli-embed-v2

Quantized
(9)
this model