Instructions to use gpustack/gte-multilingual-reranker-base-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gpustack/gte-multilingual-reranker-base-GGUF with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("gpustack/gte-multilingual-reranker-base-GGUF")

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium."
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Transformers

How to use gpustack/gte-multilingual-reranker-base-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="gpustack/gte-multilingual-reranker-base-GGUF")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("gpustack/gte-multilingual-reranker-base-GGUF", dtype="auto")

llama-cpp-python

How to use gpustack/gte-multilingual-reranker-base-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="gpustack/gte-multilingual-reranker-base-GGUF",
	filename="gte-multilingual-reranker-base-FP16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use gpustack/gte-multilingual-reranker-base-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M

Use Docker

docker model run hf.co/gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use gpustack/gte-multilingual-reranker-base-GGUF with Ollama:
```
ollama run hf.co/gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M
```

Unsloth Studio

How to use gpustack/gte-multilingual-reranker-base-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for gpustack/gte-multilingual-reranker-base-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for gpustack/gte-multilingual-reranker-base-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for gpustack/gte-multilingual-reranker-base-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use gpustack/gte-multilingual-reranker-base-GGUF with Docker Model Runner:
```
docker model run hf.co/gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M
```

Lemonade

How to use gpustack/gte-multilingual-reranker-base-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull gpustack/gte-multilingual-reranker-base-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gte-multilingual-reranker-base-GGUF-Q4_K_M

List all available models

lemonade list

gte-multilingual-reranker-base-GGUF

!!! Experimental supported by gpustack/llama-box v0.0.72+ only !!!

Model creator: Alibaba-NLP
Original model: gte-multilingual-reranker-base
GGUF quantization: based on llama.cpp f4d2b that patched by llama-box

gte-multilingual-reranker-base

The gte-multilingual-reranker-base model is the first reranker model in the GTE family of models, featuring several key attributes:

High Performance: Achieves state-of-the-art (SOTA) results in multilingual retrieval tasks and multi-task representation model evaluations when compared to reranker models of similar size.
Training Architecture: Trained using an encoder-only transformers architecture, resulting in a smaller model size. Unlike previous models based on decode-only LLM architecture (e.g., gte-qwen2-1.5b-instruct), this model has lower hardware requirements for inference, offering a 10x increase in inference speed.
Long Context: Supports text lengths up to 8192 tokens.
Multilingual Capability: Supports over 70 languages.

Model Information

Model Size: 306M
Max Input Tokens: 8192

Usage

It is recommended to install xformers and enable unpadding for acceleration, refer to enable-unpadding-and-xformers.
How to use it offline: new-impl/discussions/2

Using Huggingface transformers (transformers>=4.36.0)

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name_or_path = "Alibaba-NLP/gte-multilingual-reranker-base"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name_or_path, trust_remote_code=True,
    torch_dtype=torch.float16
)
model.eval()

pairs = [["中国的首都在哪儿"，"北京"], ["what is the capital of China?", "北京"], ["how to implement quick sort in python?","Introduction of quick sort"]]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

# tensor([1.2315, 0.5923, 0.3041])

Evaluation

Results of reranking based on multiple text retreival datasets

More detailed experimental results can be found in the paper.

Cloud API Services

In addition to the open-source GTE series models, GTE series models are also available as commercial API services on Alibaba Cloud.

Embedding Models: Rhree versions of the text embedding models are available: text-embedding-v1/v2/v3, with v3 being the latest API service.
ReRank Models: The gte-rerank model service is available.

Note that the models behind the commercial APIs are not entirely identical to the open-source models.

Citation

If you find our paper or models helpful, please consider cite:

@misc{zhang2024mgtegeneralizedlongcontexttext,
      title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
      author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
      year={2024},
      eprint={2407.19669},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.19669}, 
}