|
|
--- |
|
|
license: gemma |
|
|
base_model: |
|
|
- google/embeddinggemma-300m |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: transformers.js |
|
|
tags: |
|
|
- text-embeddings-inference |
|
|
--- |
|
|
|
|
|
# embeddinggemma-300m-ONNX-uint8 |
|
|
|
|
|
Update Sep. 20, 2025: I removed the last_hidden_state output from the model and left only the sentence_embedding one. |
|
|
|
|
|
This is based on https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/blob/main/onnx/model_quantized.onnx, but it outputs a uint8 tensor instead of an f32 one. |
|
|
|
|
|
This model is compatible with Qdrant, but I'm not sure what other vector DBs it's compatible with. |
|
|
|
|
|
For calibration data I used my own multilingual dataset of around 1.5m tokens: https://github.com/electroglyph/dataset_build |
|
|
|
|
|
I ran all 1.5m tokens through the model and logged the highest/lowest values seen. I found a range of: -0.19112960994243622 to 0.22116543352603912 |
|
|
|
|
|
So I hacked on the sentence_embedding output of the ONNX model and added QuantizeLinear node based on the range of -0.22116543352603912 to 0.22116543352603912 to keep it symmetric. It would be cool if Qdrant let me specify my own zero point for a little more accuracy, but symmetric will have to do. |
|
|
|
|
|
Note: this model is no longer compatible with SentenceTransformer! Or at least I wasn't able to figure it out right away. It messes with the uint8 output. |
|
|
|
|
|
# Benchmarks |
|
|
|
|
|
For benchmarking with MTEB I dequantize the uint8 output to the f32 that MTEB expects. |
|
|
|
|
|
These retrieval benchmarks are a little wild. All the benchmarks used the `task: search result` query format. I have no idea why this model benchmarks better than the base model on most retrieval tasks, but I'll take it. |
|
|
|
|
|
 |
|
|
|
|
|
 |
|
|
|
|
|
# Example Benchmark Code |
|
|
|
|
|
```python |
|
|
import mteb |
|
|
from mteb.encoder_interface import PromptType |
|
|
import numpy as np |
|
|
import onnxruntime as rt |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
class CustomModel: |
|
|
def __init__(self) -> None: |
|
|
self.tokenizer = AutoTokenizer.from_pretrained("C:/LLM/embeddinggemma-300m-ONNX-uint8") |
|
|
self.session = rt.InferenceSession("C:/LLM/embeddinggemma-300m-ONNX-uint8/onnx/model.onnx", providers=["CPUExecutionProvider"]) |
|
|
self.scale = 0.22116543352603912 / 127.0 |
|
|
|
|
|
def dequantize(self, quantized: list | np.ndarray, scale: float) -> np.ndarray: |
|
|
quantized = np.array(quantized) |
|
|
dequant = (quantized.astype(np.float32) - 128) * scale |
|
|
if dequant.ndim == 3 and dequant.shape[0] == 1: |
|
|
return np.squeeze(dequant, axis=0) |
|
|
return dequant |
|
|
|
|
|
def encode( |
|
|
self, |
|
|
sentences: list[str], |
|
|
task_name: str, |
|
|
prompt_type: PromptType | None = None, |
|
|
**kwargs, |
|
|
) -> np.ndarray: |
|
|
if prompt_type == PromptType.query: |
|
|
sentences = [f"task: search result | query: {s}" for s in sentences] |
|
|
inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="np") |
|
|
q = self.session.run(["sentence_embedding"], dict(inputs)) |
|
|
return self.dequantize(q, self.scale) |
|
|
|
|
|
|
|
|
model = CustomModel() |
|
|
benchmark = mteb.get_benchmark("NanoBEIR") |
|
|
evaluation = mteb.MTEB(tasks=benchmark) |
|
|
results = evaluation.run(model, corpus_chunk_size=128) |
|
|
for r in results: |
|
|
print(r) |
|
|
|
|
|
``` |
|
|
|
|
|
# Example FastEmbed Usage |
|
|
|
|
|
```python |
|
|
from fastembed import TextEmbedding |
|
|
from fastembed.common.model_description import PoolingType, ModelSource |
|
|
|
|
|
TextEmbedding.add_custom_model( |
|
|
model="embeddinggemma-300m-ONNX-uint8", |
|
|
pooling=PoolingType.DISABLED, |
|
|
normalization=False, |
|
|
sources=ModelSource(hf="electroglyph/embeddinggemma-300m-ONNX-uint8"), |
|
|
dim=768, |
|
|
model_file="onnx/model.onnx", |
|
|
) |
|
|
|
|
|
model = TextEmbedding(model_name="embeddinggemma-300m-ONNX-uint8") |
|
|
embeddings = list(model.embed("test")) |
|
|
print(embeddings) |
|
|
``` |