Instructions to use BAAI/bge-m3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use BAAI/bge-m3 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-m3") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Inference
- Notebooks
- Google Colab
- Kaggle
OOM occurs in the process of converting the model to torchscript. I have a question about this issue.
Thank you for opening up such a great model.
I am an AI Engineer in Korea and plan to use Korean embedding because of its good performance.
I'm trying to serve the model as triton.
OOM occurs in the process of converting the torch model to torchscript.
It seems that more than 40GB of GPU Memory is required.
The max_length of the tokenizer is 8192 and the padding is also set to max_length.
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained('BAAI/bge-m3').to("cuda")
class BGEM3(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input_ids, attention_mask):
outputs = self.model(input_ids)
last_hidden = outputs.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
embedding = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
embedding = F.normalize(embedding, p=2, dim=1)
return embedding
sentences = ["안녕 하세요." * 10000]
dummy_input = tokenizer(sentences, max_length=8192, padding="max_length", truncation=True, return_tensors='pt').to("cuda")
dummy_input_ids = dummy_input["input_ids"]
dummy_attention_mask = dummy_input["attention_mask"]
with torch.no_grad():
torch_model = BGEM3(model)
torch_model.eval()
trace_model = torch.jit.trace_module(
mod=torch_model,
inputs={"forward": (dummy_input_ids, dummy_attention_mask)},
check_trace=False,
)
trace_model.save("model.pt")
Has anyone experienced this kind of memory issue?
Hello, I tested your code using one A800 GPU. The test results show that it only needs 18.8GB. Therefore, 40GB memory is enough.
What's more, there is an issue with your code. The pooling method implemented in your code is mean pooling. However, the pooling method of bge-m3 is CLS pooling, not mean pooling.