File size: 3,798 Bytes
10d7d7b
 
 
 
 
 
 
 
 
 
 
 
a860ce3
 
10d7d7b
 
 
 
 
 
 
 
 
 
13fb8fd
 
10d7d7b
 
 
 
 
 
bc26e23
10d7d7b
bc26e23
10d7d7b
52c9e89
10d7d7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52c9e89
10d7d7b
52c9e89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: gemma
base_model:
- google/embeddinggemma-300m
pipeline_tag: sentence-similarity
library_name: transformers.js
tags:
- text-embeddings-inference
---

# embeddinggemma-300m-ONNX-uint8

Update Sep. 20, 2025: I removed the last_hidden_state output from the model and left only the sentence_embedding one.

This is based on https://huggingface.co/onnx-community/embeddinggemma-300m-ONNX/blob/main/onnx/model_quantized.onnx, but it outputs a uint8 tensor instead of an f32 one.

This model is compatible with Qdrant, but I'm not sure what other vector DBs it's compatible with.

For calibration data I used my own multilingual dataset of around 1.5m tokens: https://github.com/electroglyph/dataset_build

I ran all 1.5m tokens through the model and logged the highest/lowest values seen. I found a range of: -0.19112960994243622 to 0.22116543352603912

So I hacked on the sentence_embedding output of the ONNX model and added QuantizeLinear node based on the range of -0.22116543352603912 to 0.22116543352603912 to keep it symmetric. It would be cool if Qdrant let me specify my own zero point for a little more accuracy, but symmetric will have to do.

Note: this model is no longer compatible with SentenceTransformer! Or at least I wasn't able to figure it out right away. It messes with the uint8 output.

# Benchmarks

For benchmarking with MTEB I dequantize the uint8 output to the f32 that MTEB expects.

These retrieval benchmarks are a little wild. All the benchmarks used the `task: search result` query format. I have no idea why this model benchmarks better than the base model on most retrieval tasks, but I'll take it.

![mteb retrieval results](./mteb_results_by_task.png)

![mteb totals](./mteb_total_scores.png)

# Example Benchmark Code

```python
import mteb
from mteb.encoder_interface import PromptType
import numpy as np
import onnxruntime as rt
from transformers import AutoTokenizer

class CustomModel:
    def __init__(self) -> None:
        self.tokenizer = AutoTokenizer.from_pretrained("C:/LLM/embeddinggemma-300m-ONNX-uint8")
        self.session = rt.InferenceSession("C:/LLM/embeddinggemma-300m-ONNX-uint8/onnx/model.onnx", providers=["CPUExecutionProvider"])
        self.scale = 0.22116543352603912 / 127.0

    def dequantize(self, quantized: list | np.ndarray, scale: float) -> np.ndarray:
        quantized = np.array(quantized)
        dequant = (quantized.astype(np.float32) - 128) * scale
        if dequant.ndim == 3 and dequant.shape[0] == 1:
            return np.squeeze(dequant, axis=0)
        return dequant

    def encode(
        self,
        sentences: list[str],
        task_name: str,
        prompt_type: PromptType | None = None,
        **kwargs,
    ) -> np.ndarray:
        if prompt_type == PromptType.query:
            sentences = [f"task: search result | query: {s}" for s in sentences]
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="np")
        q = self.session.run(["sentence_embedding"], dict(inputs))
        return self.dequantize(q, self.scale)


model = CustomModel()
benchmark = mteb.get_benchmark("NanoBEIR")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model, corpus_chunk_size=128)
for r in results:
    print(r)

```

# Example FastEmbed Usage

```python
from fastembed import TextEmbedding
from fastembed.common.model_description import PoolingType, ModelSource

TextEmbedding.add_custom_model(
    model="embeddinggemma-300m-ONNX-uint8",
    pooling=PoolingType.DISABLED,
    normalization=False,
    sources=ModelSource(hf="electroglyph/embeddinggemma-300m-ONNX-uint8"),
    dim=768,
    model_file="onnx/model.onnx",
)

model = TextEmbedding(model_name="embeddinggemma-300m-ONNX-uint8")
embeddings = list(model.embed("test"))
print(embeddings)
```