embeddinggemma-300m-ONNX-uint8 / README.md

Update README.md

52c9e89 verified 4 months ago

3.8 kB

	---
	license: gemma
	base_model:
	- google/embeddinggemma-300m
	pipeline_tag: sentence-similarity
	library_name: transformers.js
	tags:
	- text-embeddings-inference
	---

	# embeddinggemma-300m-ONNX-uint8

	Update Sep. 20, 2025: I removed the last_hidden_state output from the model and left only the sentence_embedding one.

	This is based on https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/blob/main/onnx/model_quantized.onnx, but it outputs a uint8 tensor instead of an f32 one.

	This model is compatible with Qdrant, but I'm not sure what other vector DBs it's compatible with.

	For calibration data I used my own multilingual dataset of around 1.5m tokens: https://github.com/electroglyph/dataset_build

	I ran all 1.5m tokens through the model and logged the highest/lowest values seen. I found a range of: -0.19112960994243622 to 0.22116543352603912

	So I hacked on the sentence_embedding output of the ONNX model and added QuantizeLinear node based on the range of -0.22116543352603912 to 0.22116543352603912 to keep it symmetric. It would be cool if Qdrant let me specify my own zero point for a little more accuracy, but symmetric will have to do.

	Note: this model is no longer compatible with SentenceTransformer! Or at least I wasn't able to figure it out right away. It messes with the uint8 output.

	# Benchmarks

	For benchmarking with MTEB I dequantize the uint8 output to the f32 that MTEB expects.

	These retrieval benchmarks are a little wild. All the benchmarks used the `task: search result` query format. I have no idea why this model benchmarks better than the base model on most retrieval tasks, but I'll take it.

	![mteb retrieval results](./mteb_results_by_task.png)

	![mteb totals](./mteb_total_scores.png)

	# Example Benchmark Code

	```python
	import mteb
	from mteb.encoder_interface import PromptType
	import numpy as np
	import onnxruntime as rt
	from transformers import AutoTokenizer

	class CustomModel:
	def __init__(self) -> None:
	self.tokenizer = AutoTokenizer.from_pretrained("C:/LLM/embeddinggemma-300m-ONNX-uint8")
	self.session = rt.InferenceSession("C:/LLM/embeddinggemma-300m-ONNX-uint8/onnx/model.onnx", providers=["CPUExecutionProvider"])
	self.scale = 0.22116543352603912 / 127.0

	def dequantize(self, quantized: list \| np.ndarray, scale: float) -> np.ndarray:
	quantized = np.array(quantized)
	dequant = (quantized.astype(np.float32) - 128) * scale
	if dequant.ndim == 3 and dequant.shape[0] == 1:
	return np.squeeze(dequant, axis=0)
	return dequant

	def encode(
	self,
	sentences: list[str],
	task_name: str,
	prompt_type: PromptType \| None = None,
	**kwargs,
	) -> np.ndarray:
	if prompt_type == PromptType.query:
	sentences = [f"task: search result \| query: {s}" for s in sentences]
	inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="np")
	q = self.session.run(["sentence_embedding"], dict(inputs))
	return self.dequantize(q, self.scale)


	model = CustomModel()
	benchmark = mteb.get_benchmark("NanoBEIR")
	evaluation = mteb.MTEB(tasks=benchmark)
	results = evaluation.run(model, corpus_chunk_size=128)
	for r in results:
	print(r)

	```

	# Example FastEmbed Usage

	```python
	from fastembed import TextEmbedding
	from fastembed.common.model_description import PoolingType, ModelSource

	TextEmbedding.add_custom_model(
	model="embeddinggemma-300m-ONNX-uint8",
	pooling=PoolingType.DISABLED,
	normalization=False,
	sources=ModelSource(hf="electroglyph/embeddinggemma-300m-ONNX-uint8"),
	dim=768,
	model_file="onnx/model.onnx",
	)

	model = TextEmbedding(model_name="embeddinggemma-300m-ONNX-uint8")
	embeddings = list(model.embed("test"))
	print(embeddings)
	```