Padding token for batched embedding in Transformers?

#24

by ChrisCrass - opened Aug 2, 2024

Aug 2, 2024

Wondering if there are any best or special practices for embedding batches of documents with this model. In my own testing I have found that the presence of extra items in a batch (if it causes any padding to occur) can have an impact on the resulting embedding compared to case of a single-document batch.

The tokenizer in the Transformers approach always ends an eos token, but it doesn't add any bos tokens (which are also the same as the eos token), and further it uses the eos/bos token as a padding token... Is that by design?

Tips would be much appreciated

thenlper

Alibaba-NLP org Aug 14, 2024

Whether batch inference uses padding tokens depends on the tokenizer's padding parameter being set to true or false. We recommend not using the padding mode.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment