Ziv Embedder โ€” Fast (ONNX)

ONNX export of sentence-transformers/all-MiniLM-L6-v2, built for Ziv โ€” a lightweight local code intelligence tool.

Why ONNX?

Standard This model
Runtime PyTorch + sentence-transformers onnxruntime only
Install size ~800MB ~92MB
Inference Same Same

No PyTorch. No sentence-transformers. Just onnxruntime.

Usage with Ziv

ziv init --model fast
ziv start

Usage standalone

Install dependencies

pip install onnxruntime tokenizers huggingface_hub hf_transfer numpy

Download model

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="ziv-ai/embedder-fast-onnx",
    repo_type="model",
    local_dir=".ziv/models/embedder-fast-onnx",
    local_dir_use_symlinks=False,
)

Run inference

import os
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

model_dir = ".ziv/models/embedder-fast-onnx"

tokenizer = Tokenizer.from_file(os.path.join(model_dir, "tokenizer.json"))
tokenizer.enable_truncation(max_length=256)
tokenizer.enable_padding(pad_token="[PAD]", length=None)

session = ort.InferenceSession(
    os.path.join(model_dir, "model.onnx"),
    providers=["CPUExecutionProvider"]
)

texts = ["How does authentication work?", "Explain the login flow"]
encoded = tokenizer.encode_batch(texts)

input_ids      = np.array([e.ids            for e in encoded], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encoded], dtype=np.int64)
token_type_ids = np.array([e.type_ids       for e in encoded], dtype=np.int64)

outputs = session.run(None, {
    "input_ids":      input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})

# Mean pooling
token_embeddings = outputs[0]
mask = attention_mask[..., None].astype(np.float32)
embeddings = (token_embeddings * mask).sum(1) / mask.sum(1).clip(min=1e-9)

# L2 normalize
norms = np.linalg.norm(embeddings, axis=1, keepdims=True).clip(min=1e-12)
embeddings = embeddings / norms

print(embeddings.shape)           # (2, 384)
print(embeddings @ embeddings.T)  # cosine similarity matrix (2x2)

Model Details

Property Value
Base model sentence-transformers/all-MiniLM-L6-v2
Embedding dimensions 384
Max sequence length 256
Pooling strategy Mean pooling
Normalization L2
ONNX opset 18
Model size ~92MB

Files

File Description
model.onnx ONNX model weights and graph
tokenizer.json Tokenizer vocabulary and rules
tokenizer_config.json Tokenizer settings
1_Pooling/config.json Pooling strategy config
2_Normalize/ Signals L2 normalization is applied
config.json Model architecture config
modules.json Sentence Transformers pipeline order

License

This model is released under the Apache 2.0 License, the same license as the original sentence-transformers/all-MiniLM-L6-v2 model.

Original model by Microsoft and the sentence-transformers team.

Citation

If you use this in your work, please cite the original model:

@article{wang2020minilm,
  title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
  author={Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming},
  journal={arXiv preprint arXiv:2002.10957},
  year={2020}
}
Downloads last month
47
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ziv-ai/embedder-fast-onnx

Quantized
(78)
this model

Paper for ziv-ai/embedder-fast-onnx