Latency observed in Embedding computation

by RajaRamKankipati - opened Apr 13, 2023

Discussion

RajaRamKankipati

Apr 13, 2023

•

edited Apr 13, 2023

Hi Team,

Implementing MPNET code for long documents which have more than 512 tokens in the following approach:

Get all the tokens from the tokenizers without truncation
Split the tokens in chunks of 512 and
Pass the chunks to the model in a batch

encoded_input = tokenizer(
            document,
            max_length=None,
            padding=True,
            truncation=False,
            return_tensors="pt",
        ).to(device)

encoded_input = pre_processing_encoded_input(encoded_input, size = 512)       

# Compute token embeddings
with torch.no_grad():
      model_output = self.model(**encoded_input)

With a simple encoded_input of 512 tokens, the model takes around 230ms to compute the embedding, with the array shape (2, 512) taking 2000ms and increasing exponentially, is there any way I can achieve low latency using the model for long documents ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment