|
|
--- |
|
|
tags: |
|
|
- mrl |
|
|
- multilingual |
|
|
--- |
|
|
# vectorizer.banana |
|
|
|
|
|
This model is a vectorizer developed by Sinequa. |
|
|
It produces an embedding vector given a passage or a query. |
|
|
The passage vectors are stored in our vector index and the query vector is used at query time to look up relevant passages in the index. |
|
|
|
|
|
Model name: `vectorizer.banana` |
|
|
|
|
|
## Supported Languages |
|
|
|
|
|
Since this model is a distilled version of the [BGE-M3](https://huggingface.co/BAAI/bge-m3) model, it can theoritically handle 100+ languages. |
|
|
|
|
|
## Scores |
|
|
|
|
|
We computed the differences in performance w.r.t the original [BGE-M3](https://huggingface.co/BAAI/bge-m3) on MS MARCO EN. Scores on famous benchmarks (BEIR, MIRACL, MTEB, etc.) can be found directly in the model card of BGE-M3 under line "Dense". We expect the performance to drop linearly with the same scale than the observed with MS MARCO EN for other datasets. |
|
|
|
|
|
| Model | Performance Relative to BGE-M3 | |
|
|
|:-----------------------------------------------|:------------------------------:| |
|
|
| vectorizer.banana (1024 dimensions) | 99.3% | |
|
|
| vectorizer.banana (768 dimensions) | 98.8% | |
|
|
| vectorizer.banana (512 dimensions) | 98% | |
|
|
| **vectorizer.banana (256 dimensions*)** | 95.7% | |
|
|
|
|
|
\* *The default dimension within Sinequa* |
|
|
|
|
|
## Inference Times |
|
|
|
|
|
| GPU | Quantization type | Batch size 1 | Batch size 32 | |
|
|
|:------------------------------------------|:------------------|-----------------:|----------------:| |
|
|
| NVIDIA A10 | FP16 | 4.5 ms | 43 ms | |
|
|
| NVIDIA T4 | FP16 | 2.5 ms | 35 ms | |
|
|
|
|
|
## GPU Memory Usage |
|
|
|
|
|
| Quantization type | Memory | |
|
|
|:-------------------------------------------------|------------:| |
|
|
| FP16 | 1450 MiB | |
|
|
|
|
|
Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which can be around 0.5 to 1 GiB depending on the used GPU. |
|
|
|
|
|
## Requirements |
|
|
- Minimal Sinequa version: 11.11.0.2306 |
|
|
- [Cuda compute capability](https://developer.nvidia.com/cuda-gpus): above 5.0 (above 6.0 for FP16 use) |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Configuration |
|
|
|
|
|
Note that this model will be packaged with a default MRL cutoff of 256 dimensions . In order to use the 1024 dimensions or any other value the `mrl-cutoff` parameter needs to be set. |
|
|
|
|
|
### Training |
|
|
|
|
|
This model used the [BGE-M3](https://huggingface.co/BAAI/bge-m3), a good and compact multilingual embedding model as a backbone for distillation. |
|
|
|
|
|
The original model size was 24 layers and then reduced to 5 layers. |
|
|
To obtain a low dimensional output space (256 compared to the original 1024), [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) was used at training time. |