all-MiniLM-L6-v2 ONNX (with attention outputs)
A custom ONNX export of sentence-transformers/all-MiniLM-L6-v2
that includes attention weights as graph outputs, packaged as a single file
for in-browser inference via Transformers.js.
Used by forwardpass.dev — an interactive visual guide to LLMs — to power live sentence embeddings, PCA-based semantic similarity visualization, and real attention heatmaps.
Why this export
Standard ONNX exports of sentence-transformers strip out attention weights because they aren't needed for inference. This export preserves them as named graph outputs so the model can be used for educational visualizations of how bidirectional (encoder) attention works.
Model details
- Base model:
sentence-transformers/all-MiniLM-L6-v2(22M parameters) - Architecture: BERT encoder (bidirectional, no causal mask)
- Layers: 6 transformer blocks
- Heads: 12 attention heads per layer
- Hidden size: 384
- Vocabulary: 30,522 tokens (WordPiece)
- Quantization: fp32
- File size: ~87 MB
- Format: Single-file ONNX (no external data)
ONNX inputs / outputs
| Name | Shape | Description |
|---|---|---|
input_ids |
[batch, seq] |
Token IDs |
attention_mask |
[batch, seq] |
1 for real tokens, 0 for padding |
token_type_ids |
[batch, seq] |
All zeros for single-segment input |
last_hidden_state |
[batch, seq, 384] |
Per-token contextual embeddings |
attentions.0 ... attentions.5 |
[batch, 12, seq, seq] |
Attention weights per layer |
Usage
Transformers.js
import { AutoTokenizer, AutoModel } from "@huggingface/transformers";
const tokenizer = await AutoTokenizer.from_pretrained("dbernsohn/all-MiniLM-L6-v2-onnx");
const model = await AutoModel.from_pretrained("dbernsohn/all-MiniLM-L6-v2-onnx", {
dtype: "fp32",
});
const inputs = tokenizer("This is a test sentence", { padding: true });
const output = await model(inputs);
// output.last_hidden_state: [1, seq, 384]
// output["attentions.0"] ... output["attentions.5"]: [1, 12, seq, seq]
// Mean-pool to get a sentence embedding (384-dim)
const hidden = output.last_hidden_state;
const [, seqLen, hiddenDim] = hidden.dims;
const data = hidden.data;
const meanEmbedding = new Array(hiddenDim).fill(0);
for (let t = 0; t < seqLen; t++) {
for (let d = 0; d < hiddenDim; d++) {
meanEmbedding[d] += data[t * hiddenDim + d];
}
}
for (let d = 0; d < hiddenDim; d++) meanEmbedding[d] /= seqLen;
Python (onnxruntime)
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbernsohn/all-MiniLM-L6-v2-onnx")
session = ort.InferenceSession("onnx/model.onnx")
inputs = tokenizer("This is a test sentence", return_tensors="np", padding=True)
outputs = session.run(None, {
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64),
"token_type_ids": inputs.get("token_type_ids", np.zeros_like(inputs["input_ids"])).astype(np.int64),
})
last_hidden_state = outputs[0] # [1, seq, 384]
attentions = outputs[1:] # 6 tensors of [1, 12, seq, seq]
# Mean-pool to sentence embedding
sentence_embedding = last_hidden_state.mean(axis=1) # [1, 384]
Reproducing this export
Install dependencies:
pip install torch transformers optimum[onnxruntime] onnx onnxscript sentence-transformers
Then run the script below (also available at scripts/export_model.py in the forwardpass.dev repo):
import shutil
from pathlib import Path
import torch
import onnx
from transformers import AutoTokenizer, AutoModel
MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
OUTPUT_DIR = Path("exported-model")
if OUTPUT_DIR.exists():
shutil.rmtree(OUTPUT_DIR)
OUTPUT_DIR.mkdir()
(OUTPUT_DIR / "onnx").mkdir()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModel.from_pretrained(MODEL_ID, attn_implementation="eager")
model.eval()
# Wrapper that returns hidden states + attentions as separate outputs
class AttentionWrapper(torch.nn.Module):
def __init__(self, base_model):
super().__init__()
self.base = base_model
def forward(self, input_ids, attention_mask, token_type_ids):
out = self.base(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
output_attentions=True,
)
return (out.last_hidden_state,) + out.attentions
wrapper = AttentionWrapper(model)
wrapper.eval()
dummy = tokenizer("Hello world", return_tensors="pt")
input_ids = dummy["input_ids"]
attention_mask = dummy["attention_mask"]
token_type_ids = dummy.get("token_type_ids", torch.zeros_like(input_ids))
with torch.no_grad():
test_out = wrapper(input_ids, attention_mask, token_type_ids)
num_attn_layers = len(test_out) - 1
output_names = ["last_hidden_state"]
dynamic_axes = {
"input_ids": {0: "batch", 1: "seq"},
"attention_mask": {0: "batch", 1: "seq"},
"token_type_ids": {0: "batch", 1: "seq"},
"last_hidden_state": {0: "batch", 1: "seq"},
}
for i in range(num_attn_layers):
name = f"attentions.{i}"
output_names.append(name)
dynamic_axes[name] = {0: "batch", 2: "seq", 3: "seq"}
onnx_path = OUTPUT_DIR / "onnx" / "model.onnx"
torch.onnx.export(
wrapper,
(input_ids, attention_mask, token_type_ids),
str(onnx_path),
input_names=["input_ids", "attention_mask", "token_type_ids"],
output_names=output_names,
dynamic_axes=dynamic_axes,
opset_version=17,
do_constant_folding=True,
dynamo=False,
)
# Merge external data into single file
data_file = onnx_path.parent / "model.onnx.data"
if data_file.exists():
m = onnx.load(str(onnx_path), load_external_data=True)
onnx.save_model(m, str(onnx_path), save_as_external_data=False)
data_file.unlink()
# Save tokenizer + config
tokenizer.save_pretrained(str(OUTPUT_DIR))
model.config.save_pretrained(str(OUTPUT_DIR))
Credits
- Base model: sentence-transformers/all-MiniLM-L6-v2 by the sentence-transformers team
- Original BERT: Google Research
- MiniLM paper: Deep Self-Attention Distillation for Task-Agnostic Compression
- Export & hosting: Dor Bernsohn for forwardpass.dev
License
Apache 2.0 (inherited from the base model).
- Downloads last month
- 36
Model tree for dbernsohn/all-MiniLM-L6-v2-onnx
Base model
sentence-transformers/all-MiniLM-L6-v2