PhraseBERT ONNX
ONNX export of whaleloops/phrase-bert for lightweight inference using ONNX Runtime — no PyTorch or Transformers required.
Model Details
- Base model: whaleloops/phrase-bert (BERT-base, 12 layers, 768 hidden dim)
- Pooling: Mean pooling (attention-mask weighted)
- Format: ONNX
- Size: ~416 MB
Usage
Install dependencies (no torch/transformers needed)
pip install onnxruntime tokenizers numpy
Download and run
from huggingface_hub import snapshot_download
# Download the model
model_dir = snapshot_download("langminer/phrase-bert-onnx")
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
# Load model and tokenizer
session = ort.InferenceSession(f"{model_dir}/model.onnx", providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file(f"{model_dir}/tokenizer.json")
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
tokenizer.enable_truncation(max_length=512)
# Encode phrases
phrases = ["play an active role", "participate actively", "machine learning"]
encodings = tokenizer.encode_batch(phrases)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.array([e.type_ids for e in encodings], dtype=np.int64)
# Run inference
outputs = session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids": token_type_ids,
})
token_embeddings = outputs[0] # (batch, seq_len, 768)
# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = np.sum(token_embeddings * mask, axis=1) / np.sum(mask, axis=1)
print(embeddings.shape) # (3, 768)
Citation
@inproceedings{wang2021phrase,
title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration},
author={Wang, Shufan and Thompson, Laure and Iyyer, Mohit},
booktitle={EMNLP},
year={2021}
}
- Downloads last month
- 11
Model tree for langminer/phrase-bert-onnx
Base model
whaleloops/phrase-bert