intelli-embed-v2
The best-performing local embedding model for GraphRAG and memory-augmented AI applications.
Built for apps that store, retrieve, and deduplicate personal memories in graph databases β intelli-embed-v2 achieves 98% of Azure text-embedding-3-small quality while running entirely on-device at ~10 ms per embedding. No API calls, no rate limits, no data leaving your infrastructure.
| Metric | Value |
|---|---|
| Sep (SW-engineering, 20 pairs) | 0.484 (GOOD) |
| vs Azure text-embedding-3-large | 94% |
| vs Azure text-embedding-3-small | 98% |
| Inference p50 (INT8 ONNX, CPU) | ~10 ms |
| Embedding dimension | 1024 |
| Max sequence length | 512 tokens |
| Model size (fp32 safetensors) | 2.17 GB |
| Model size (INT8 ONNX) | 542 MB |
Training
Fine-tuned from Snowflake/snowflake-arctic-embed-l-v2.0 using a three-phase curriculum that distills knowledge from Azure text-embedding-3-large across 200k real-world sentences.
Three-phase curriculum fine-tuning on arctic-l-v2:
| Phase | Loss | Data | Duration |
|---|---|---|---|
| 1 | GISTEmbedLoss (mxbai as teacher) | 100k SW-engineering pairs | ~1.6h |
| 2 | MSE distillation from azure-large embeddings | 200k sentences | ~9 min |
| 3 | Hard-negative MultipleNegativesRankingLoss | 7107 mined triplets | ~77s |
Usage
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("serhiiseletskyi/intelli-embed-v2")
embeddings = model.encode(["Hello world", "Another sentence"])
print(embeddings.shape) # (2, 1024)
With ONNX Runtime (INT8, recommended for CPU inference)
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("serhiiseletskyi/intelli-embed-v2")
session = ort.InferenceSession("onnx/model_quantized.onnx")
def embed(texts):
enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="np")
out = session.run(None, {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"]})[0]
# mean pool + L2 normalize
mask = enc["attention_mask"][..., None].astype(np.float32)
pooled = (out * mask).sum(1) / mask.sum(1)
return pooled / np.linalg.norm(pooled, axis=1, keepdims=True)
vecs = embed(["Hello world", "Another sentence"])
print(vecs.shape) # (2, 1024)
With ort-node (Node.js / TypeScript)
import * as ort from "onnxruntime-node";
import { AutoTokenizer } from "@huggingface/transformers";
const tokenizer = await AutoTokenizer.from_pretrained("serhiiseletskyi/intelli-embed-v2");
const session = await ort.InferenceSession.create("onnx/model_quantized.onnx");
async function embed(texts: string[]): Promise<number[][]> {
const enc = await tokenizer(texts, { padding: true, truncation: true, max_length: 512 });
const inputIds = new ort.Tensor("int64", enc.input_ids.data, enc.input_ids.dims);
const attentionMask = new ort.Tensor("int64", enc.attention_mask.data, enc.attention_mask.dims);
const result = await session.run({ input_ids: inputIds, attention_mask: attentionMask });
// result["last_hidden_state"] β mean pool β L2 normalize
// ...
return vectors;
}
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
2.17 GB | Full fp32 model weights (sentence-transformers compatible) |
onnx/model.onnx |
0.4 MB | ONNX proto (references external data file) |
onnx/model.onnx_data |
2.16 GB | ONNX external weight data (fp32) |
onnx/model_quantized.onnx |
542 MB | INT8 dynamic quantization (recommended for CPU) |
tokenizer.json |
16 MB | Tokenizer (XLM-RoBERTa based) |
1_Pooling/config.json |
β | Mean pooling config |
Benchmark Results (run15, 2026-02-23)
Evaluated on a 6-suite benchmark including SW-engineering pairs, memory-domain pairs, dedup fitness, asymmetric retrieval, negation safety, and entity description retrieval.
| Provider | Sep | Grade | p50ms |
|---|---|---|---|
| azure-large (cloud) | 0.515 | GOOD | ~110 |
| azure-small (cloud) | 0.511 | GOOD | ~80 |
| intelli-embed-v2 (INT8) | 0.484 | GOOD | ~10 |
| arctic-l-v2 (q8, base model) | 0.469 | GOOD | ~10 |
| intelli-ensemble | 0.450 | EXCELLENT | ~86 |
Sep = mean(PosSim) β mean(NegSim) on SW-engineering pairs β higher is better.
OpenMemory Use-Case Metrics
| Metric | Value | Notes |
|---|---|---|
| memSep | 0.439 | EXCELLENT β personal memory discrimination |
| dedupGap | 0.102 | Near-dedup vs not-dedup cosine delta |
| asyncSep | 0.240 | FAIR β short query β long memory retrieval |
| negGap | 0.026 | Negation safety (BM25 gate still recommended) |
| supSim | 0.672 | Supersede zone (~0.75β0.92 is ideal) |
| entSep | 0.491 | GOOD β entity description retrieval |
License
Apache 2.0 β inherited from base model Snowflake/snowflake-arctic-embed-l-v2.0.
- Downloads last month
- 12
Model tree for serhiiseletskyi/intelli-embed-v2
Base model
Snowflake/snowflake-arctic-embed-l-v2.0