Ziv Embedder — Fast (ONNX)

ONNX export of sentence-transformers/all-MiniLM-L6-v2, built for Ziv — a lightweight local code intelligence tool.

Why ONNX?

	Standard	This model
Runtime	PyTorch + sentence-transformers	onnxruntime only
Install size	~800MB	~92MB
Inference	Same	Same

No PyTorch. No sentence-transformers. Just onnxruntime.

Usage with Ziv

ziv init --model fast
ziv start

Usage standalone

Install dependencies

pip install onnxruntime tokenizers huggingface_hub hf_transfer numpy

Download model

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="ziv-ai/embedder-fast-onnx",
    repo_type="model",
    local_dir=".ziv/models/embedder-fast-onnx",
    local_dir_use_symlinks=False,
)

Run inference

import os
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

model_dir = ".ziv/models/embedder-fast-onnx"

tokenizer = Tokenizer.from_file(os.path.join(model_dir, "tokenizer.json"))
tokenizer.enable_truncation(max_length=256)
tokenizer.enable_padding(pad_token="[PAD]", length=None)

session = ort.InferenceSession(
    os.path.join(model_dir, "model.onnx"),
    providers=["CPUExecutionProvider"]
)

texts = ["How does authentication work?", "Explain the login flow"]
encoded = tokenizer.encode_batch(texts)

input_ids      = np.array([e.ids            for e in encoded], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encoded], dtype=np.int64)
token_type_ids = np.array([e.type_ids       for e in encoded], dtype=np.int64)

outputs = session.run(None, {
    "input_ids":      input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})

# Mean pooling
token_embeddings = outputs[0]
mask = attention_mask[..., None].astype(np.float32)
embeddings = (token_embeddings * mask).sum(1) / mask.sum(1).clip(min=1e-9)

# L2 normalize
norms = np.linalg.norm(embeddings, axis=1, keepdims=True).clip(min=1e-12)
embeddings = embeddings / norms

print(embeddings.shape)           # (2, 384)
print(embeddings @ embeddings.T)  # cosine similarity matrix (2x2)

Model Details

Property	Value
Base model	sentence-transformers/all-MiniLM-L6-v2
Embedding dimensions	384
Max sequence length	256
Pooling strategy	Mean pooling
Normalization	L2
ONNX opset	18
Model size	~92MB

Files

File	Description
`model.onnx`	ONNX model weights and graph
`tokenizer.json`	Tokenizer vocabulary and rules
`tokenizer_config.json`	Tokenizer settings
`1_Pooling/config.json`	Pooling strategy config
`2_Normalize/`	Signals L2 normalization is applied
`config.json`	Model architecture config
`modules.json`	Sentence Transformers pipeline order

License

This model is released under the Apache 2.0 License, the same license as the original sentence-transformers/all-MiniLM-L6-v2 model.

Original model by Microsoft and the sentence-transformers team.

Citation

If you use this in your work, please cite the original model:

@article{wang2020minilm,
  title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
  author={Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming},
  journal={arXiv preprint arXiv:2002.10957},
  year={2020}
}

Downloads last month: 10

Model tree for ziv-ai/embedder-fast-onnx

Base model

nreimers/MiniLM-L6-H384-uncased

Quantized

sentence-transformers/all-MiniLM-L6-v2

Quantized

(87)

this model

Paper for ziv-ai/embedder-fast-onnx

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Paper • 2002.10957 • Published Feb 25, 2020 • 2