OOM occurs in the process of converting the model to torchscript. I have a question about this issue.

by LeeJungHoon - opened Feb 7, 2024

Feb 7, 2024

Thank you for opening up such a great model.
I am an AI Engineer in Korea and plan to use Korean embedding because of its good performance.

I'm trying to serve the model as triton.
OOM occurs in the process of converting the torch model to torchscript.
It seems that more than 40GB of GPU Memory is required.
The max_length of the tokenizer is 8192 and the padding is also set to max_length.

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained('BAAI/bge-m3').to("cuda")

class BGEM3(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids)
        last_hidden = outputs.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)

        embedding = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        embedding = F.normalize(embedding, p=2, dim=1)
        return embedding


sentences = ["안녕 하세요." * 10000]
dummy_input = tokenizer(sentences, max_length=8192, padding="max_length", truncation=True, return_tensors='pt').to("cuda")

dummy_input_ids = dummy_input["input_ids"]
dummy_attention_mask = dummy_input["attention_mask"]


with torch.no_grad():
    torch_model = BGEM3(model)
    torch_model.eval()

    trace_model = torch.jit.trace_module(
        mod=torch_model,
        inputs={"forward": (dummy_input_ids, dummy_attention_mask)},
        check_trace=False,
    )

    trace_model.save("model.pt")

Has anyone experienced this kind of memory issue?

hanhainebula

Beijing Academy of Artificial Intelligence org Feb 12, 2024

Hello, I tested your code using one A800 GPU. The test results show that it only needs 18.8GB. Therefore, 40GB memory is enough.
What's more, there is an issue with your code. The pooling method implemented in your code is mean pooling. However, the pooling method of bge-m3 is CLS pooling, not mean pooling.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment